From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 36F28C48260 for ; Tue, 6 Feb 2024 03:08:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 992786B0075; Mon, 5 Feb 2024 22:08:26 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 942C96B0078; Mon, 5 Feb 2024 22:08:26 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 831C56B007B; Mon, 5 Feb 2024 22:08:26 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 71C026B0075 for ; Mon, 5 Feb 2024 22:08:26 -0500 (EST) Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 3DFD6160C05 for ; Tue, 6 Feb 2024 03:08:26 +0000 (UTC) X-FDA: 81759895812.01.AB25EB7 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) by imf22.hostedemail.com (Postfix) with ESMTP id B5375C0006 for ; Tue, 6 Feb 2024 03:08:22 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf22.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1707188904; a=rsa-sha256; cv=none; b=cE4maMG203LPic4GlV4f7HJD+p+RApEQeTt69wVXb6E0WT5VuwMnKKcO5jCM5FAWoK0BDG VJ428nvd+7P86STDjBQk/1EOIYowBANdBYbDlSmWHJ+vsvYKlwB9r1qfXhHJzr+qLL6ysB BZVBtdF/AiEi+WXMJ/l1amIqHWfsZT4= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf22.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1707188904; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=M28Cygd5wOPWfGMQ8ZF5qC7ibNdH4TWZAS1awsuJSDo=; b=WSexrE6UnAElbUJNk2dlCpGNTplDAnAw5Stm2yAjKJQy5OMdYpk3ik407xQiGI+8+2GemR j7860bnb1t08NMVuiNmxYhv3yWSSI9sBjQQigymtotAVs27olIWFDSLSExo6CK/ufdtnGz jXDMH6a43WjFqlrNftBJLEH9sQTQfhI= Received: from mail.maildlp.com (unknown [172.19.88.234]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4TTSqB17yvz1vt8x; Tue, 6 Feb 2024 11:07:50 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id AFCC91402C7; Tue, 6 Feb 2024 11:08:17 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Tue, 6 Feb 2024 11:08:16 +0800 Message-ID: <2c70f390-d449-fec2-350e-e73b578ad8c3@huawei.com> Date: Tue, 6 Feb 2024 11:08:15 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH] filemap: avoid unnecessary major faults in filemap_fault() Content-Language: en-US To: Yin Fengwei , David Hildenbrand , "Huang, Ying" CC: , , , , , , , References: <20240204093526.212636-1-zhangpeng362@huawei.com> <87zfwf39ha.fsf@yhuang6-desk2.ccr.corp.intel.com> <85e03dd9-8bd7-d516-ebe4-84dd449a9fb2@huawei.com> <87mssf2yiv.fsf@yhuang6-desk2.ccr.corp.intel.com> <25de8872-ad79-e5e6-054c-9ac5e7191416@huawei.com> <1a967b8d-7218-4d3f-9dd2-ae1c66f626c7@redhat.com> <07e5ee4d-e0b7-4799-80a1-cccfbab2dd4d@intel.com> From: "zhangpeng (AS)" In-Reply-To: <07e5ee4d-e0b7-4799-80a1-cccfbab2dd4d@intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems701-chm.china.huawei.com (10.3.19.178) To kwepemm600020.china.huawei.com (7.193.23.147) X-Rspamd-Server: rspam08 X-Rspamd-Queue-Id: B5375C0006 X-Stat-Signature: iorzape4e45ckojg9hoqshf139zr3nuk X-Rspam-User: X-HE-Tag: 1707188902-917764 X-HE-Meta: U2FsdGVkX18u50YEc3/yG5gbiIv7CdD1FdlgUYIOG2PDUqmVwmd5B4Wg+kjFyh2A+0gkV2yCx58kz2qew19RSAdVPVEi0mbPGGZQTsixLr05HuFAZuqPluyCxffV8ep5ZZ08kx9TP9dpKxB4gN5ygyN4LsXbhk9cltX0hCvqbvu6eBumdKbh8M97efPt8Zp0VbRI9OGEDsmNyOp0OHU7Q9xbsbb9b/4fJXIAV7XC3LVMMHQ/NM/Jlp2hCJM27mOFtezZ72OeYh0YpswvgZjcuoKF1NcPvzqJFzERcQ5wgxamsQ6iZlBCFj4aANm5vIAXLX8kblqgJfVNMq5JZKd3flHmmexFlc8W2Ehbf46QospxPECm8cjU4UuxRMpWjK9xQpL4tJAfHFEHIC9XJ3CDtlXs55doa4sSJIEWt7hRwkiKbYRnmRJHmL0rU1F5de4VIQKmqylAMBVc8Ehf/l3P12Ka8z4sG75KVFpD+qWlKJFm0jnKAM6pUwxsjO4UHRtiU57B2k1YAWJ7w4E7jTq5cHViVKXmfE5cPa3jN4uWjNoWf88qsP6yYUg1KoCykwag78Luxd+Wd6PmW88xtoPlJf/QQtxucxrgosG9IyHoepi7bVIwWTnWLx288+iLHiuxv4obzAg1NnQloJzY8IaS+NA1J+0YtAJU3AUj+uBA/K6GDHFeITp4gQUEIc1WhD99Dtuemi7Vk1HWXWbm5vgVgATmt34vZIdSb/KoIqmv+jpsP/VJry9Rlm1suDfMhGdq8jWmtJbwzEg2cNi7tp5FlHGV7cuv8km58AGMQr4c21Q4DV7dhi/p3uBSgkiPGdGNx2eIohdRTrLi0KIyb6Tnipv1CVa9mZRLklLez/sJZf53Nn4SHe2gUXFptpkWz+sy85qqJZmf9bmYU93iXnOpyoV2NmBuUPKjG+zi0Cm9k06IbTteXz0U0xgdrL7HFrag7E668sSsEG3KBv/Q+mc eeUK4xlH E2nsOr5gdA10JRQosbGoOnEEIDayGgHZepIi7K2xV4dU7Itvd+xmgOn6ABZiMTW8/W7e5Q/LETw4LjBBpVzNAc6JEm+nORAXbF3mgduE90MLo135mtpWVrt+FxqQoxZgSb5Kjp+go6r/OLwS0Wb2J7qX0jv1xJDJkN24054Yr3ZugCR6K73bVdbaArsyFQ0UGQgwF2DDIJUz0M+XPl9u4l1lpbFrUczRnv7pJCuZe1qrF5kjdDyxQ+IctsFeV2VS0cbWCdsivsRTep4mD6u8xTtGXtjNB258PtZpA7h2fwJ5odkLFZmQArvYZsSG795U9NoYpMq6436QQhSY= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/2/5 16:40, Yin Fengwei wrote: > > On 2/5/24 15:36, zhangpeng (AS) wrote: >> On 2024/2/5 15:31, David Hildenbrand wrote: >> >>> On 05.02.24 08:24, zhangpeng (AS) wrote: >>>> On 2024/2/5 14:52, Huang, Ying wrote: >>>> >>>>> "zhangpeng (AS)" writes: >>>>>> On 2024/2/5 10:56, Huang, Ying wrote: >>>>>>> Peng Zhang writes: >>>>>>>> From: ZhangPeng >>>>>>>> >>>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>>>> >>>>>>>> This caused by temporarily cleared PTE during a read/modify/write update >>>>>>>> of the PTE, eg, do_numa_page()/change_pte_range(). >>>>>>>> >>>>>>>> For the data segment of the user-mode program, the global variable area >>>>>>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>>>>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>>>>>> (anonymous pages), but the original file pages cannot be locked and may >>>>>>>> be reclaimed. If the global variable (private anon page) is accessed when >>>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>>>>> >>>>>>>> At this time, the original private file page may have been reclaimed. >>>>>>>> If the page cache is not available at this time, a major fault will be >>>>>>>> triggered and the file will be read, causing additional overhead. >>>>>>>> >>>>>>>> Fix this by rechecking the PTE without acquiring PTL in filemap_fault() >>>>>>>> before triggering a major fault. >>>>>>>> >>>>>>>> Testing file anonymous page read and write page fault performance in ext4 >>>>>>>> and ramdisk using will-it-scale[2] on a x86 physical machine. The data >>>>>>>> is the average change compared with the mainline after the patch is >>>>>>>> applied. The test results are within the range of fluctuation, and there >>>>>>>> is no obvious difference. The test results are as follows: >>>>>>>>             processes processes_idle threads threads_idle >>>>>>>> ext4 file write:    -1.14%    -0.08%         -1.87% 0.13% >>>>>>>> ext4 file read:         0.03%      -0.65% -0.51%    -0.08% >>>>>>>> ramdisk file write:    -1.21%    -0.21%         -1.12% 0.11% >>>>>>>> ramdisk file read:     0.00%    -0.68%         -0.33% -0.02% >>>>>>>> >>>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>>>>> [2] https://github.com/antonblanchard/will-it-scale/ >>>>>>>> >>>>>>>> Suggested-by: "Huang, Ying" >>>>>>>> Suggested-by: Yin Fengwei >>>>>>>> Signed-off-by: ZhangPeng >>>>>>>> Signed-off-by: Kefeng Wang >>>>>>>> --- >>>>>>>> RFC->v1: >>>>>>>> - Add error handling when ptep == NULL per Huang, Ying and Matthew Wilcox >>>>>>>> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>>>>>>>      Huang, Ying and Yin Fengwei >>>>>>>> - Add pmd_none() check before PTE map >>>>>>>> - Update commit message and add performance test information >>>>>>>> >>>>>>>>     mm/filemap.c | 18 ++++++++++++++++++ >>>>>>>>     1 file changed, 18 insertions(+) >>>>>>>> >>>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>>> index 142864338ca4..b29cdeb6a03b 100644 >>>>>>>> --- a/mm/filemap.c >>>>>>>> +++ b/mm/filemap.c >>>>>>>> @@ -3238,6 +3238,24 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>>>>                 mapping_locked = true; >>>>>>>>             } >>>>>>>>         } else { >>>>>>>> +        if (!pmd_none(*vmf->pmd)) { >>>>>>>> +            pte_t *ptep; >>>>>>>> + >>>>>>>> +            ptep = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd, >>>>>>>> +                             vmf->address, &vmf->ptl); >>>>>>>> +            if (unlikely(!ptep)) >>>>>>>> +                return VM_FAULT_NOPAGE; >>>>>>>> +            /* >>>>>>>> +             * Recheck pte as the pte can be cleared temporarily >>>>>>>> +             * during a read/modify/write update. >>>>>>>> +             */ >>>>>>> I think that we should add some comments here about the racy checking. >>>>>> I'll add comments in a v2 as follows: >>>>>> /* >>>>>>    * Recheck PTE as the PTE can be cleared temporarily >>>>>>    * during a read/modify/write update of the PTE, eg, >>>>>>    * do_numa_page()/change_pte_range(). This will trigger >>>>>>    * a major fault, even if we use mlockall, which may >>>>>>    * affect performance. >>>>>>    */ >>>>> Sorry, my previous words aren't clear enough.  I mean some comments as >>>>> follows, >>>>> >>>>> We don't hold PTL here, so the check is still racy.  But acquiring PTL >>>>> hurts performance and the race window seems small enough. >>>> Got it. I'll add comments in a v2 as follows: >>>> /* >>>>    * Recheck PTE as the PTE can be cleared temporarily >>>>    * during a read/modify/write update of the PTE. >>>>    * We don't hold PTL here as acquiring PTL hurts >>>>    * performance. So the check is still racy, but >>>>    * the race window seems small enough. >>>>    */ >>> It'd be worth spelling out what happens when we lose the race. >>> >> I'll add what happens when we lose the race as follows: >> /* >>  * Recheck PTE as the PTE can be cleared temporarily >>  * during a read/modify/write update of the PTE, eg, >>  * do_numa_page()/change_pte_range(). This will trigger >>  * a major fault, even if we use mlockall, which may >>  * affect performance. >>  * We don't hold PTL here as acquiring PTL hurts >>  * performance. So the check is still racy, but >>  * the race window seems small enough. >>  */ >> > I believe David was asking to add: > > "...but the race window seems small enough. > > If we lose the race during the check, the page_fault will > be triggered. Butthe page table entry lock still make sure > the correctness: > - If the page cache is not reclaimed, the page_fault will > work like the page fault was served already and bail out. > - If the page cache is reclaimed, the major fault will be > triggered, page cache is filled, page_fault also work > like the page fault was served already and bail out. > " Got it. Thanks! -- Best Regards, Peng