From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id F2F05C624B4 for ; Fri, 24 Nov 2023 04:28:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 81F008D0060; Thu, 23 Nov 2023 23:28:45 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 7CFBE8D0002; Thu, 23 Nov 2023 23:28:45 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 696B88D0060; Thu, 23 Nov 2023 23:28:45 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 5DF3C8D0002 for ; Thu, 23 Nov 2023 23:28:45 -0500 (EST) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 3056A1CC948 for ; Fri, 24 Nov 2023 04:28:45 +0000 (UTC) X-FDA: 81491567010.12.DE7CF3C Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by imf11.hostedemail.com (Postfix) with ESMTP id E7AB440007 for ; Fri, 24 Nov 2023 04:28:41 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fAeCBim0; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1700800122; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=foFfsR8+dQraes/T16ARVfYC2O3I/JKQiCK+GO3A9AI=; b=BPSC18XqqDIKVIJOERxk9nbIT8hGG4Vir9I3txItJAtFrj5UuTFKV3UEB0GdGz7fEYxqmj WpAUE2V0Mg0IW9gP9oTwjsAk+Sh02VlV2OSlMUGu8aAoJNSgtArAbQ45bTs6gtmePhK8v7 05wsIOrnZb0iTs+cPsAhb6zszLO3O4M= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=fAeCBim0; dmarc=pass (policy=none) header.from=intel.com; spf=pass (imf11.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.120 as permitted sender) smtp.mailfrom=ying.huang@intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1700800122; a=rsa-sha256; cv=none; b=sUmb87T9Xxhjkr6pWo1adkI8mGrRX5BHklNQO+Ozr5512Zwa+e6fIr/tu1JLBgljNDzSNX 77o/bD5rKJCCoLK9pmDjE6OyK6HnLAy/iS5TFQw8PPLSKes+fN82MRASlI/Vua0SjKaPQN jwuQMPEzQCiytdQmgwrdB0TSE7bXwxw= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1700800122; x=1732336122; h=from:to:cc:subject:in-reply-to:references:date: message-id:mime-version:content-transfer-encoding; bh=J4Ile0Qosg4PQ4W33TLsXAlMkveaq9qRrr4ZS5TQ5es=; b=fAeCBim0IGf5L6RM2aWCKJupCbVSEkFEmV8ZQ+EALAEGM9E5o5w84K77 Qigw7PN1OjycXQSJXmih4BoxvK93MU0mC6dMNOx7DXvWY2qqUe8GtRyeV w1g+JWPn+7rr/E3VqoyyZzGEycVA+MPXs3wgEFs0Obtjc9uOVaXcW4xEH wtMbuOBpzVUnvB0ZQv3kFJEnNmjSjoYOrGJIYjYyfpWvvjA2p19MUw6ZN 8FMcRfoZpL01403elwRshqrjCDh5XKMeb3nGLljbLX+phYP0SPhjHUyaG ss9/0SRu6vr8nIC9x9w/S2m6y1mOW1gTWFwsYrNOqc4YFfjBHnHxTme/M A==; X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="391238930" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="391238930" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:28:40 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10902"; a="1098976827" X-IronPort-AV: E=Sophos;i="6.04,223,1695711600"; d="scan'208";a="1098976827" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Nov 2023 20:28:37 -0800 From: "Huang, Ying" To: "zhangpeng (AS)" Cc: Yin Fengwei , , , , , , , , , , Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() In-Reply-To: <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> (Ying Huang's message of "Fri, 24 Nov 2023 12:13:56 +0800") References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> Date: Fri, 24 Nov 2023 12:26:36 +0800 Message-ID: <87ttpb7p4z.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: E7AB440007 X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: n9qq5ywkyosb51kn3wkqc67yitsabwtm X-HE-Tag: 1700800121-676090 X-HE-Meta: U2FsdGVkX1+PQLK2NIp7MOqFKpOM6WcOzBnzjqppm+0xEpbwvRoxTnyOai97bOS+cVJpAyZ9DUTF/ZnxWJ6fbhQIevPPalCUoIc6AARGWrEK6vc/jMN1YJXrgMaG6rkcjPUdA2SEKyum9UmrtFa/j2FWkVy1fE3avoiIAud8WKoowffMqWXzYvXL4ah3fK/KO2Vq/N39BZ4dbUxXoS4ImT2OWJKRipQKY6gY3EvftN9CeK/Go9V+ZMkWMFZOO3j/TDDBsm4OfkOH35NOp7PWda7pREZ/xiai6/Ignum7y0dv7AL4UDF8sZKNP5hOH5oh7h7PcpeTyhgMoQWHkeb1BNkGeoPIYNhK28OFp76woq92BVd3tQ0rfq0+OfzOnoLGfveUYwcwgThKwnPiBZpk0vc/2OnQ3pesPtmkgdrLwEdAUGLS3w8nHm40dcxOFovs/J28e5A2nOFOKsF8ORRllXGSLf4li7LVOn3Jp8OdIccQjmxCw7kfxeb3ZqDtKbROQ+TPb5l6Ucz5tZJfIxVl0Q4OTzfexddW3MXbd+t5qS66TDw+y1mc+sCX6ZxQPelU3tKZYQv/ntPSXkg76cDrTQn7S6p6QxLrXU5KPoJN1Q8YiTzcNVjw1b53oDMpYdivf4dl6ECUAMOmsIhb7FiHjZOk2I8JS/pvarvd1l7cYPg37a8IXkZ4xIaa/c/aMgiOc48Wes8VO2158hL2NIniS8X5ff69TdSFrRu8ktrS8DbL41t9ZEXhC/i0r4q9BflSzt6ARCpcjliBuzZE5UJAMGOk1/Dr8w8SrpDUA0lsTST7J74GuBpQV8vqXvcN8u6VBjVq+R70vkBsEg+UCjX1Cr3cLZ7UEyMrVGbesQQrL6flTHhcDEjhxKH8aJ7hL8YtBAt3LF6JSQTJHCH7AUBIh3AQnLwd+X9Kdl4f6iEJ2FCZaSS5/pXouUvja4iAXSSVwdqOY1QoYsi4zv3I5AF +CvDBFo9 WV4qv8MdfqE/jLjAd6BiVFvOoHnwgCVpLxHADDTqsVkcvOt+1nUEFjc2aZOGLLy0yO0F+AR789nQkWZVwsYEPrzyKSkY97TuckD3B91qcsng1jebP++IItOO65K4U1LviX/dRO4bdqcU04R6yw5wwT4kyYilgo/DRBtU8ASbEYSluDEGLvXhEsOEhH3zYeNYGfmW0CCjTfEXjBCquRgAI8/CuZ02fYsqzfVvnGhnWNAYXRmnuwsToNVUDZBLOxI/TGiWVwXvkywYj0fUvDA2QGpteNwpbgV6lbtevMtje23EcgzkFMgED9ky0DsHoXoLD5g8JQL7dl4kHxO64LXHqngUaG77fanzRu3Qe X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: "Huang, Ying" writes: > "zhangpeng (AS)" writes: > >> On 2023/11/23 13:26, Yin Fengwei wrote: >> >>> On 11/23/23 12:12, zhangpeng (AS) wrote: >>>> On 2023/11/23 9:09, Yin Fengwei wrote: >>>> >>>>> Hi Peng, >>>>> >>>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>>> From: ZhangPeng >>>>>> >>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTUR= E) >>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>> >>>>>> This caused by temporarily cleared pte during a read/modify/write up= date >>>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>>> >>>>>> For the data segment of the user-mode program, the global variable a= rea >>>>>> is a private mapping. After the pagecache is loaded, the private ano= nymous >>>>>> page is generated after the COW is triggered. Mlockall can lock COW = pages >>>>>> (anonymous pages), but the original file pages cannot be locked and = may >>>>>> be reclaimed. If the global variable (private anon page) is accessed= when >>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggere= d. >>>>>> >>>>>> At this time, the original private file page may have been reclaimed. >>>>>> If the page cache is not available at this time, a major fault will = be >>>>>> triggered and the file will be read, causing additional overhead. >>>>>> >>>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() bef= ore >>>>>> triggering a major fault. >>>>>> >>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17= 434ee@huawei.com/ >>>>>> >>>>>> Signed-off-by: ZhangPeng >>>>>> Signed-off-by: Kefeng Wang >>>>>> --- >>>>>> =C2=A0 mm/filemap.c | 14 ++++++++++++++ >>>>>> =C2=A0 1 file changed, 14 insertions(+) >>>>>> >>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>>> --- a/mm/filemap.c >>>>>> +++ b/mm/filemap.c >>>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 mapping_locked =3D true; >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } else { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pte_t *ptep =3D pte_offs= et_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 vmf->address, &vmf->ptl); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (ptep) { >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = /* >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * Recheck pte with ptl locked as the pte can be cleared >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 * temporarily during a read/modify/write update. >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 */ >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = if (unlikely(!pte_none(ptep_get(ptep)))) >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 ret =3D VM_FAULT_NOPAGE; >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = pte_unmap_unlock(ptep, vmf->ptl); >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = if (unlikely(ret)) >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 return ret; >>>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 } >>>>> I am curious. Did you try not to take PTL here and just check whether= PTE is not NONE? >>>> Thank you for your reply. >>>> >>>> If we don't take PTL, the current use case won't trigger this issue ei= ther. >>> Is this verified by testing or just in theory? >> >> If we add a delay between ptep_modify_prot_start() and ptep_modify_prot_= commit(), >> this issue will also trigger. Without delay, we haven't reproduced this = problem >> so far. >> >>>> In most cases, if we don't take PTL, this issue won't be triggered. Ho= wever, >>>> there is still a possibility of triggering this issue. The corner case= is that >>>> task 2 triggers a page fault when task 1 is between ptep_modify_prot_s= tart() >>>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 pa= sses the >>>> check whether the PTE is not NONE before task 1 updates PTE in >>>> ptep_modify_prot_commit() without taking PTL. >>> There is very limited operations between ptep_modify_prot_start() and >>> ptep_modify_prot_commit(). While the code path from page fault to this = check is >>> long. My understanding is it's very likely the PTE is not NONE when do = PTE check >>> here without hold PTL (This is my theory. :)). >> >> Yes, there is a high probability that this issue won't occur without tak= ing PTL. >> >>> In the other side, acquiring/releasing PTL may bring performance impact= ion. It may >>> not be big deal because the IO operations in this code path. But it's b= etter to >>> collect some performance data IMHO. >> >> We tested the performance of file private mapping page fault (page_fault= 2.c of >> will-it-scale [1]) and file shared mapping page fault (page_fault3.c of = will-it-scale). >> The difference in performance (in operations per second) before and afte= r patch >> applied is about 0.7% on a x86 physical machine. > > Whether is it improvement or reduction? And I think that you need to test ramdisk cases too to verify whether this will cause performance regression and how much. -- Best Regards, Huang, Ying > -- > Best Regards, > Huang, Ying > >> [1] https://github.com/antonblanchard/will-it-scale/tree/master >> >>> >>> Regards >>> Yin, Fengwei >>> >>>>> Regards >>>>> Yin, Fengwei >>>>> >>>>>> + >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 /* No page i= n the page cache at all */ >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_vm_eve= nt(PGMAJFAULT); >>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 count_memcg_= event_mm(vmf->vma->vm_mm, PGMAJFAULT);