From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B9EC9C61D97 for ; Fri, 24 Nov 2023 04:16:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 273CB8D005F; Thu, 23 Nov 2023 23:16:35 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1FCC58D0002; Thu, 23 Nov 2023 23:16:35 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 09F5E8D005F; Thu, 23 Nov 2023 23:16:35 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id ECE538D0002 for ; Thu, 23 Nov 2023 23:16:34 -0500 (EST) Received: from smtpin23.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C0409120162 for ; Fri, 24 Nov 2023 04:16:34 +0000 (UTC) X-FDA: 81491536308.23.7CA0BDE Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by imf13.hostedemail.com (Postfix) with ESMTP id 549D920022 for ; Fri, 24 Nov 2023 04:16:31 +0000 (UTC) Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Ao4gjdGV; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700799392; a=rsa-sha256; cv=none; b=RGotvKBH7MxjxBDX19CKEO+BTfCmHzVbsswwMJGsjiNQS3lRlpNP69RxhF1e9i4c/To56q V8BpNwJasdC/DguDxxEOaU5K1gdPYXkXSOAlA8z/6mcVowMcMew/J9iC6WrVTWe6R3V1Av 9bMcrZuJ29UoevCeUTMog08yZA6PPEI= ARC-Authentication-Results: i=1; imf13.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=Ao4gjdGV; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf13.hostedemail.com: domain of ying.huang@intel.com designates 134.134.136.20 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700799392; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BTeT8cshMmmS5qVboYoZ7C53gThvEto6vl81yAiVHkk=; b=SCmdjSztxz7HMRbopYBxoynOBC9iMU+5xGkgXv8o0BrS/uZzYx9JmXt7QZ7YrTKOadLamw q8mQI8s6Vf6UaBtNnEbA9GNvhUM4JE4DERbq1vL1VVfoQH9l8k3210P+KuTiV3sSEDReVJ JCeyWXXMctT0ixuslw+z0ERG5iruBVM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700799391; x=1732335391; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=wLbfhTiNsp3RmLNz+uw+lM6yKEd6DDGxnCGNuclyhiQ=; b=Ao4gjdGVLZvikOf+xVqb1YsG9ZkuXp0ZReX0MlUSjDEqTubYQYyophwF iX05bq47KUSku4/sn7Wb0Ja1ybuoDsfxScyv17uYBB6q3NjOleNZFdVqy m9Er56MDJUxeHAzhK5SpAZIEh5Ww6IclbhvxErWvU81ccTmoqTGWNMnIJ lnAhD8xW5fzu1T83mFUMGAf5JsKwMOXWk5il6a0f0pMcObIhQWbHV/MLR 6vqzn7FVtG3kMq5CMKg6AruWw3CJE7o97h9gUh+IclqKaYWYH+sh76q6r 7K0limDY8/M0bXOLtUfbXATulbyFSJKmRrwnEwXncV27FyJQj122jiqCc w==; X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="382766471" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="382766471" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:16:00 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="858254423" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="858254423" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga003-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:15:57 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> (zhangpeng's message of "Thu, 23 Nov 2023 15:57:44 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> Date: Fri, 24 Nov 2023 12:13:56 +0800 Message-ID: <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 549D920022 X-Stat-Signature: ztma9m5mhd7gnw3ft1s6xwgdie9uyxxq X-HE-Tag: 1700799391-44765 X-HE-Meta: U2FsdGVkX1+Z1RmR5GRHIqM805mGvAOqf5RYUwTOy+Uu3GDt/L8ylkniPj2m/d37m98N2zD6P+j26KY/cjGTLn3pD6lT3uEHTn9MLDwUkj1Qf/0pV4VQjXVNDtCCVlT9isnpCPkIthjDF80NAJEX9jBX5IbOyTL9vDhvJmnKEHppXd3CQO6XqwGun+6rSeV4I+DvnHdcxm8Twlw2CDCejlAjiDdx5KDUBM4g9FGDkI74aEcDtk6jIVtlsLTMWuWT8E2TF6jZAYWyzaniw8EGwRwmvXO/3zKySbfbutsn6WtlKYoc3cLREM8SHplurahJ4DjbSG3TeBsTln+WolNG9o3a6+wLe3p7fVkOpnTVJmKmnhV6HX/GT7M6w2AFvuvIBT1oSQKjCHTnZDxlVeiDP99Lu9aZo1u3Hsu6E2k7h0zE/aRPcnK9EQa5NOWabN3h+ib8tct2ehjP+awzs7wumK1clSY2f81heIfIf4eM4RRMDCRQ2rwT6RgXnb0ikIQShMG0ErGmK/aeQfzKLUc+TlESg+/ASJ3gld2/YyZog5uEpJ4oqIRYYWJoDNi/sDhOZSxDgGP0eWhmq2LrjUQmGgmJNtXiQYAUVi2rDpxQr1cKHxLA0NKBLJUBOzttEpIXZbPDSkz0KAwIN/6OCLEkGRkm1bRdqwSZJwGcenBAOrgEJDci2PJTI5BLrF9J3nMGdZ14UUzAxoDLn/5GZwluTd3AleUOVljUiYz6Yo8WAtHQfw5EC5ewTgty7nKgpuJr0I1l5lIjtc05rC7cgH/0KNzDlYcZzoW3/VENer437FY+XA7YcMqGshp4r3dC1WEEhA/e4Bai/IBJc39BVKbZNUHQjSr146EldeyExCuv0upLsX+TQFqw2QtJplLvO4+DC6SoSYJ2KlWFQX/qNhM+6IgMDPNTpg6AHEmJ1goHOwCkYcKP4VdGFiYR41dYwRmZvrmNcBqfUHIqmFIi66z BFzLjIHO 8pAjeRqL67v12ax6+NjMrZOGk7OChkaJdAGV4YRpfM30KtsBJay96wkUnqw2mhrxy7XB70uDmj2tRNvKaOHtME2KR3Th7bYcLbNbhR9a81WizrVa8zVGT0nasJ61mqyhDLkJFg9KBls22QuKUHp2+HwjTIrZzqqe5RU9PSRLggzWzXT4EdW2liis1t2GC1IYyO3ZRv32J5imjDvkzQm5aE8x+bdDO+fiDjM+fWzKIfexli7+sU1PBwcYglXXEUq9NJ1fNuwuszXTgl5L9PNiiepseprzvhJ8LL17QUocGl4gar7AEpuoXOeXfmQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "zhangpeng (AS)" writes: > On 2023/11/23 13:26, Yin Fengwei wrote: > >> On 11/23/23 12:12, zhangpeng (AS) wrote: >>> On 2023/11/23 9:09, Yin Fengwei wrote: >>> >>>> Hi Peng, >>>> >>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>> From: ZhangPeng >>>>> >>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>> in application, which leading to an unexpected performance issue[1]. >>>>> >>>>> This caused by temporarily cleared pte during a read/modify/write upd= ate >>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>> >>>>> For the data segment of the user-mode program, the global variable ar= ea >>>>> is a private mapping. After the pagecache is loaded, the private anon= ymous >>>>> page is generated after the COW is triggered. Mlockall can lock COW p= ages >>>>> (anonymous pages), but the original file pages cannot be locked and m= ay >>>>> be reclaimed. If the global variable (private anon page) is accessed = when >>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>> >>>>> At this time, the original private file page may have been reclaimed. >>>>> If the page cache is not available at this time, a major fault will be >>>>> triggered and the file will be read, causing additional overhead. >>>>> >>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() befo= re >>>>> triggering a major fault. >>>>> >>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa174= 34ee@huawei.com/ >>>>> >>>>> Signed-off-by: ZhangPeng >>>>> Signed-off-by: Kefeng Wang >>>>> --- >>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>> >>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>> --- a/mm/filemap.c >>>>> +++ b/mm/filemap.c >>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 mapping_locked =3D true; >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pte_offse= t_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 vmf->address, &vmf->ptl); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * temporarily during a read/modify/write update. >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (unlikely(!pte_none(ptep_get(ptep)))) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 p= te_unmap_unlock(ptep, vmf->ptl); >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 i= f (unlikely(ret)) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>> I am curious. Did you try not to take PTL here and just check whether = PTE is not NONE? >>> Thank you for your reply. >>> >>> If we don't take PTL, the current use case won't trigger this issue eit= her. >> Is this verified by testing or just in theory? > > If we add a delay between ptep_modify_prot_start() and ptep_modify_prot_c= ommit(), > this issue will also trigger. Without delay, we haven't reproduced this p= roblem > so far. > >>> In most cases, if we don't take PTL, this issue won't be triggered. How= ever, >>> there is still a possibility of triggering this issue. The corner case = is that >>> task 2 triggers a page fault when task 1 is between ptep_modify_prot_st= art() >>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 pas= ses the >>> check whether the PTE is not NONE before task 1 updates PTE in >>> ptep_modify_prot_commit() without taking PTL. >> There is very limited operations between ptep_modify_prot_start() and >> ptep_modify_prot_commit(). While the code path from page fault to this c= heck is >> long. My understanding is it's very likely the PTE is not NONE when do P= TE check >> here without hold PTL (This is my theory. :)). > > Yes, there is a high probability that this issue won't occur without taki= ng PTL. > >> In the other side, acquiring/releasing PTL may bring performance impacti= on. It may >> not be big deal because the IO operations in this code path. But it's be= tter to >> collect some performance data IMHO. > > We tested the performance of file private mapping page fault (page_fault2= .c of > will-it-scale [1]) and file shared mapping page fault (page_fault3.c of w= ill-it-scale). > The difference in performance (in operations per second) before and after= patch > applied is about 0.7% on a x86 physical machine. Whether is it improvement or reduction? -- Best Regards, Huang, Ying > [1] https://github.com/antonblanchard/will-it-scale/tree/master > >> >> Regards >> Yin, Fengwei >> >>>> Regards >>>> Yin, Fengwei >>>> >>>>> + >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* No page in= the page cache at all */ >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_vm_even= t(PGMAJFAULT); >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_memcg_e= vent_mm(vmf->vma->vm_mm, PGMAJFAULT);