From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D3D6DC47DDB for ; Thu, 1 Feb 2024 12:10:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4D8946B0072; Thu, 1 Feb 2024 07:10:36 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 488F16B0074; Thu, 1 Feb 2024 07:10:36 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 32A126B0075; Thu, 1 Feb 2024 07:10:36 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 1ED1D6B0072 for ; Thu, 1 Feb 2024 07:10:36 -0500 (EST) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id AC375120D60 for ; Thu, 1 Feb 2024 12:10:35 +0000 (UTC) X-FDA: 81743118030.16.B2B74E9 Received: from szxga07-in.huawei.com (szxga07-in.huawei.com [45.249.212.35]) by imf03.hostedemail.com (Postfix) with ESMTP id 6EB9B20010 for ; Thu, 1 Feb 2024 12:10:31 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706789433; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ecgzE9+OVtW87N2C3Ik8l9j0w0AvKzpgtE2HTLB7Sws=; b=S3bLAIeMdDSdATY3MqXzKwri50KvJ+1aHO/aY1a+HnJznnDvRI9+9K5LqG0bIw02klaefq Njvgjqhye09L1e4IimDHWSPPAVXvlE2bvESSfQBvNSM7Ok3QZkvNjiXiJnauh3nNkFn/32 2WxJjzXC3plrdz3S6hFQVH5p1DyU40U= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706789433; a=rsa-sha256; cv=none; b=C1ToYn6DSlWB6BgQxDUcMWgYIasnJqhNsekH0fgBIvL7CHFUUR+Z2ccIAoDdeYjEicRBEy wxSQggJ2E/m/guxfNOu7Iv7CEWeNQYj1ZuOtsXWI+VtutglEElaSTyZU47SUvqoGyHX8s6 52qsmzXrgh+W2m6IcrePlt3kb/6O6M8= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; spf=pass (imf03.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.35 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com Received: from mail.maildlp.com (unknown [172.19.163.44]) by szxga07-in.huawei.com (SkyGuard) with ESMTP id 4TQd3Q2PP9z1Q8TN; Thu, 1 Feb 2024 20:08:34 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id 0AB3E14040D; Thu, 1 Feb 2024 20:10:26 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2507.35; Thu, 1 Feb 2024 20:10:24 +0800 Message-ID: <4da573ec-a2f9-84f4-f729-540492192957@huawei.com> Date: Thu, 1 Feb 2024 20:10:24 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [RFC PATCH] mm: filemap: avoid unnecessary major faults in filemap_fault() Content-Language: en-US To: "Huang, Ying" CC: Yin Fengwei , , , , , , , , , , References: <20231122140052.4092083-1-zhangpeng362@huawei.com> <801bd0c9-7d0c-4231-93e5-7532e8231756@intel.com> <48235d73-3dc6-263d-7822-6d479b753d46@huawei.com> <87y1en7pq3.fsf@yhuang6-desk2.ccr.corp.intel.com> <87ttpb7p4z.fsf@yhuang6-desk2.ccr.corp.intel.com> <87lean7f2c.fsf@yhuang6-desk2.ccr.corp.intel.com> <87plzt464d.fsf@yhuang6-desk2.ccr.corp.intel.com> From: "zhangpeng (AS)" In-Reply-To: <87plzt464d.fsf@yhuang6-desk2.ccr.corp.intel.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems706-chm.china.huawei.com (10.3.19.183) To kwepemm600020.china.huawei.com (7.193.23.147) X-Stat-Signature: h85au19i39ng8wfy1gtt1issp3cecbj1 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 6EB9B20010 X-Rspam-User: X-HE-Tag: 1706789431-856424 X-HE-Meta: U2FsdGVkX1/P+EsK4yeEp+71ngJ5iVYEZ/xcytOIWm9lYUSBo+iZ4Eb1HZuMZS05gMC1w2/vBSZKByfj2/zWe3XzGOI4B3lUZGa8aeWNgScYU+Lc0WHqhi8wrd+20a6gK7TKanXavckEyPO+kXs9RMfAVgDVbyAAYkOD0xv+BKd/QZMcAI97g6c7BxLjvknxRbut3Yq/d9EVq3HokA2JmnfrZ8lGVR0pgxp49mab0Pb2diw9lHkYa0GGqh4tEDXm4lkEjN45S8/ChsziYU7llbo3IM4+er4QRXW2anm5AR7VHOXQS60mF9bt4INU9auDrjs6OB2UYMRsJ/S95p06Hf0nJV7h/r7d8jVCei3cxk5RYUfA9zK4TZyOlg20kua58Imfrmk9gwo2QEvYt2gots6z3O1HqnOtHIU8qBI+UVe/bIqlyORP1TZX4Zii/PWBZbe6bg2gAFzYBRwgWTUDJUN5LTIjdQypX6z1dAUHvpAxcIiazbLzsmBGXuOVeCVy8O/fsFwNfYL2Us5cZbzFxGPGWo8Uao+dw/ozVx+UEQriH2ILV4+tNA83fyfTZokL6lidmASQZv03OEK2uoTuyVHC681TPTNtd1tJEfTojyUNwQJcpImK/y/qXbkUuxC14TewLiSlAbzaeXls/OW3UcZz38AAvgAQY/yfmAA4st+8AYcM3Fm8Frk/8yjdD/0Eq5vpkHwzuL6GzAaN942UHiWxuRL6gglpQC0upQbb5B8KNHo/3bjjokOYBdc+qxjKvfG1top1g2QQ3Bp/HtgHNWqawHjIqIVblD6cfyU9eQxKpLSCWiFSshDnZoX6IsMC9ObwJIPFuSz9GU58t8n8KHeeleNj9O9eWLCTp78JrR94hFtKpA7dOjP1XcB0B1X4+AR+nmsEab8ahwsbsbn6M0z52tA7VxtYdd+xz/le9k+AVFCjefrwekLSLOGanzhpxlQv+b3/hRpN9DA7CoZ c2wzRRJu RhQvTJNWxCYAh+CF9P/aMe+bhTERl59jWRYZrgSuzxDz7ArCNUhBbLe2wprwr6WtuV7EBqeSevIqAfxpbMDdeHoSshhPydxYqDvNMRTZrW/GqWW41Zdn3tlytDWCcPoVGw5Gd1rj5E7n3oJxKvcHNuQMR9T/6rSPb8UtXK+B6e9D/7DrGUJWVPnXxQqBU7S57vox/SPAvV6KF0WStemNy+9ME6EZ58ziZqO1CVDEc6pTotyS6btn4o+iDbRhNPYfVNeOAiQ8ouGOe6Q7cNPiHtfpJMQgDVdNhsX71JqLVfKetAGL/Hg8lao0BGuApz1n8XvBFkCVo3v6+mHhdFqNTnolPPQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2023/11/29 10:59, Huang, Ying wrote: > "zhangpeng (AS)" writes: > >> On 2023/11/24 16:04, Huang, Ying wrote: >> >>> "zhangpeng (AS)" writes: >>> >>>> On 2023/11/24 12:26, Huang, Ying wrote: >>>> >>>>> "Huang, Ying" writes: >>>>> >>>>>> "zhangpeng (AS)" writes: >>>>>> >>>>>>> On 2023/11/23 13:26, Yin Fengwei wrote: >>>>>>> >>>>>>>> On 11/23/23 12:12, zhangpeng (AS) wrote: >>>>>>>>> On 2023/11/23 9:09, Yin Fengwei wrote: >>>>>>>>> >>>>>>>>>> Hi Peng, >>>>>>>>>> >>>>>>>>>> On 11/22/23 22:00, Peng Zhang wrote: >>>>>>>>>>> From: ZhangPeng >>>>>>>>>>> >>>>>>>>>>> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >>>>>>>>>>> in application, which leading to an unexpected performance issue[1]. >>>>>>>>>>> >>>>>>>>>>> This caused by temporarily cleared pte during a read/modify/write update >>>>>>>>>>> of the pte, eg, do_numa_page()/change_pte_range(). >>>>>>>>>>> >>>>>>>>>>> For the data segment of the user-mode program, the global variable area >>>>>>>>>>> is a private mapping. After the pagecache is loaded, the private anonymous >>>>>>>>>>> page is generated after the COW is triggered. Mlockall can lock COW pages >>>>>>>>>>> (anonymous pages), but the original file pages cannot be locked and may >>>>>>>>>>> be reclaimed. If the global variable (private anon page) is accessed when >>>>>>>>>>> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >>>>>>>>>>> >>>>>>>>>>> At this time, the original private file page may have been reclaimed. >>>>>>>>>>> If the page cache is not available at this time, a major fault will be >>>>>>>>>>> triggered and the file will be read, causing additional overhead. >>>>>>>>>>> >>>>>>>>>>> Fix this by rechecking the pte by holding ptl in filemap_fault() before >>>>>>>>>>> triggering a major fault. >>>>>>>>>>> >>>>>>>>>>> [1] https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >>>>>>>>>>> >>>>>>>>>>> Signed-off-by: ZhangPeng >>>>>>>>>>> Signed-off-by: Kefeng Wang >>>>>>>>>>> --- >>>>>>>>>>>   mm/filemap.c | 14 ++++++++++++++ >>>>>>>>>>>   1 file changed, 14 insertions(+) >>>>>>>>>>> >>>>>>>>>>> diff --git a/mm/filemap.c b/mm/filemap.c >>>>>>>>>>> index 71f00539ac00..bb5e6a2790dc 100644 >>>>>>>>>>> --- a/mm/filemap.c >>>>>>>>>>> +++ b/mm/filemap.c >>>>>>>>>>> @@ -3226,6 +3226,20 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) >>>>>>>>>>>               mapping_locked = true; >>>>>>>>>>>           } >>>>>>>>>>>       } else { >>>>>>>>>>> +        pte_t *ptep = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, >>>>>>>>>>> +                          vmf->address, &vmf->ptl); >>>>>>>>>>> +        if (ptep) { >>>>>>>>>>> +            /* >>>>>>>>>>> +             * Recheck pte with ptl locked as the pte can be cleared >>>>>>>>>>> +             * temporarily during a read/modify/write update. >>>>>>>>>>> +             */ >>>>>>>>>>> +            if (unlikely(!pte_none(ptep_get(ptep)))) >>>>>>>>>>> +                ret = VM_FAULT_NOPAGE; >>>>>>>>>>> +            pte_unmap_unlock(ptep, vmf->ptl); >>>>>>>>>>> +            if (unlikely(ret)) >>>>>>>>>>> +                return ret; >>>>>>>>>>> +        } >>>>>>>>>> I am curious. Did you try not to take PTL here and just check whether PTE is not NONE? >>>>>>>>> Thank you for your reply. >>>>>>>>> >>>>>>>>> If we don't take PTL, the current use case won't trigger this issue either. >>>>>>>> Is this verified by testing or just in theory? >>>>>>> If we add a delay between ptep_modify_prot_start() and ptep_modify_prot_commit(), >>>>>>> this issue will also trigger. Without delay, we haven't reproduced this problem >>>>>>> so far. >>>>>>> >>>>>>>>> In most cases, if we don't take PTL, this issue won't be triggered. However, >>>>>>>>> there is still a possibility of triggering this issue. The corner case is that >>>>>>>>> task 2 triggers a page fault when task 1 is between ptep_modify_prot_start() >>>>>>>>> and ptep_modify_prot_commit() in do_numa_page(). Furthermore,task 2 passes the >>>>>>>>> check whether the PTE is not NONE before task 1 updates PTE in >>>>>>>>> ptep_modify_prot_commit() without taking PTL. >>>>>>>> There is very limited operations between ptep_modify_prot_start() and >>>>>>>> ptep_modify_prot_commit(). While the code path from page fault to this check is >>>>>>>> long. My understanding is it's very likely the PTE is not NONE when do PTE check >>>>>>>> here without hold PTL (This is my theory. :)). >>>>>>> Yes, there is a high probability that this issue won't occur without taking PTL. >>>>>>> >>>>>>>> In the other side, acquiring/releasing PTL may bring performance impaction. It may >>>>>>>> not be big deal because the IO operations in this code path. But it's better to >>>>>>>> collect some performance data IMHO. >>>>>>> We tested the performance of file private mapping page fault (page_fault2.c of >>>>>>> will-it-scale [1]) and file shared mapping page fault (page_fault3.c of will-it-scale). >>>>>>> The difference in performance (in operations per second) before and after patch >>>>>>> applied is about 0.7% on a x86 physical machine. >>>>>> Whether is it improvement or reduction? >>>>> And I think that you need to test ramdisk cases too to verify whether >>>>> this will cause performance regression and how much. >>>> Yes, I will. >>>> In addition, are there any ramdisk test cases recommended? 😁 >>> I think that you can start with the will-it-scale test case you used >>> before. And you can try some workload with large number of major fault, >>> like file read with mmap. >> I used will-it-scale to test the page faults of ext4 files and >> tmpfs files. The data is the average change compared with the >> mainline after the patch is applied. The test results are within >> the range of fluctuation, and there is no obvious difference. >> The test results are as follows: >> >> processes processes_idle threads threads_idle >> ext4 private file write: -0.51% 0.08% -0.03% -0.04% >> ext4 shared file write: 0.135% -0.531% 2.883% -0.772% >> tmpfs private file write: -0.344% -0.110% 0.200% 0.145% >> tmpfs shared file write: 0.958% 0.101% 2.781% -0.337% >> tmpfs private file read: -0.16% 0.00% -0.12% 0.41% > Thank you very much for test results! > > We shouldn't use tmpfs, because there will be no major faults. Please > check your major faults number to verify that. IIUC, ram disk + disk > file system should be used. > > And, please make sure that there's no heavy lock contention in the base > kernel. Because if some heavy lock contention kills performance, there > will no performance difference between based and patched kernel. I'm so sorry I was so late to finish the test and reply. I used will-it-scale to test the page faults of ramdisk files. The data is the average change compared with the mainline after the patch is applied. The test results are as follows: processes processes_idle threads threads_idle ramdisk private file write: -0.48% 0.23% -1.08% 0.27% ramdisk private file read: 0.07% -6.90% -5.85% -0.70% Applied patch: diff --git a/mm/filemap.c b/mm/filemap.c index 32eedf3afd45..2db9ccfbd5e3 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -3226,6 +3226,22 @@ vm_fault_t filemap_fault(struct vm_fault *vmf) mapping_locked = true; } } else { + if (!pmd_none(*vmf->pmd)) { + pte_t *ptep = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd, + vmf->address, &vmf->ptl); + if (unlikely(!ptep)) + return VM_FAULT_NOPAGE; + /* + * Recheck pte with ptl locked as the pte can be cleared + * temporarily during a read/modify/write update. + */ + if (unlikely(!pte_none(ptep_get(ptep)))) + ret = VM_FAULT_NOPAGE; + pte_unmap_unlock(ptep, vmf->ptl); + if (unlikely(ret)) + return ret; + } + /* No page in the page cache at all */ count_vm_event(PGMAJFAULT); count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT); > -- > Best Regards, > Huang, Ying -- Best Regards, Peng