From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 38A62C48BF6 for ; Thu, 29 Feb 2024 12:21:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9E60D6B00DE; Thu, 29 Feb 2024 07:21:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 995926B00DF; Thu, 29 Feb 2024 07:21:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 85DF56B00E0; Thu, 29 Feb 2024 07:21:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 75C0D6B00DE for ; Thu, 29 Feb 2024 07:21:51 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 177D940B2E for ; Thu, 29 Feb 2024 12:21:51 +0000 (UTC) X-FDA: 81844752822.19.3EDEA66 Received: from szxga06-in.huawei.com (szxga06-in.huawei.com [45.249.212.32]) by imf27.hostedemail.com (Postfix) with ESMTP id 439D740006 for ; Thu, 29 Feb 2024 12:21:46 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1709209308; a=rsa-sha256; cv=none; b=4euXGPqEQIpStCH+1xuIRzWUlKX64RjwGQv4+oc4bYynQi5OQLb5JYswQuTK/xckBt1cSG GcDF10R2iLkn/1icVDLLHKnVeTAE3VANwCWp3Yj/33nleOXsfwJpZxw0xQBAjNPxDM3rbP ek8onzssZRVoFs84gAbOgIpXqPlZlBw= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of zhangpeng362@huawei.com designates 45.249.212.32 as permitted sender) smtp.mailfrom=zhangpeng362@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1709209308; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=X/czXgVSQDklNayIpI6SowF+Ci7Gi9OqpQKfYNiL4jY=; b=u8mrbrXmy0ToH8jbCWXXSfkgjqNdTgl9DwYEV1PNfhaBGtx+q+61LeTFAo/rLQWLpWwL55 t8DAEBkFV/WfF8o8CPFxsCrIXN3EcFsFHC8u+8589akxFI5/hH4IP8jBQy7TQ50HKm7Yus W2XBJPhGpUGPBRhbpTdDN3K5JwiDOTU= Received: from mail.maildlp.com (unknown [172.19.88.163]) by szxga06-in.huawei.com (SkyGuard) with ESMTP id 4Tlr0w1h5zzqjgr; Thu, 29 Feb 2024 20:21:04 +0800 (CST) Received: from kwepemm600020.china.huawei.com (unknown [7.193.23.147]) by mail.maildlp.com (Postfix) with ESMTPS id D859118002D; Thu, 29 Feb 2024 20:21:41 +0800 (CST) Received: from [10.174.179.160] (10.174.179.160) by kwepemm600020.china.huawei.com (7.193.23.147) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.1.2507.35; Thu, 29 Feb 2024 20:21:40 +0800 Message-ID: <958b204f-63ec-55b3-b6e2-dc72a3ce704d@huawei.com> Date: Thu, 29 Feb 2024 20:21:39 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 Subject: Re: [PATCH v3] filemap: avoid unnecessary major faults in filemap_fault() Content-Language: en-US To: David Hildenbrand , , CC: , , , , , , , , References: <20240229060907.836589-1-zhangpeng362@huawei.com> <038e55eb-70a9-445d-89ef-4b989eaa9c66@redhat.com> From: "zhangpeng (AS)" In-Reply-To: <038e55eb-70a9-445d-89ef-4b989eaa9c66@redhat.com> Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.179.160] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To kwepemm600020.china.huawei.com (7.193.23.147) X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 439D740006 X-Stat-Signature: pw91e1ccf8rwdbub67bkftmsecykwbg9 X-HE-Tag: 1709209306-564290 X-HE-Meta: U2FsdGVkX18gyF2WJ9ZyOGLEccqSqTSIv64b3idviINPNLe/whRtReG4EOKxMhii5U17Tv8mjRLK1mMl3E+0aEGy/citrhpQOUtLV1VnVQGGh8MHW72R4IVu26BX/HE/rCyw+Nog3ZoMrEJwi1+2s210Tcuiamxl4nBnQDx40Cns75GXVqHhAaZ654sAi/UU2+EjSsMS3g1IY/Jwr7IiWpVfMYIYv76oKHai+yRuP89qJ3jlE9pgz8a8DeL7MCIdeXBprsAEjTHIsnbtVFzTkdA/9clt3fNqgfAnkeeHz/Gfu+B8Je+GgI7UXz5iza7gDjfkSSnpffF2XmsB4WiklTzqQ8+rTfJ57Q2ck9WV+jHGQIouQFLATKC8Jm2CdiY2omlEacWWwBNkTqUg4AUr9OZnm+6xJZb6cbB2B/i8aWAMelAOlWL5TZV1/EKWNAGRZtL2gdzk0sVYYzQ3miQhOkkQyTXXYAbEoB6M6anXtkgEQb2Pbc5/olLw5MWTLT7B5gqU2GGVKgzG+eenpKUbirZfQzRxaf2+Klcch+LYjlqtnCCDU9IimgOP+xrTnodLYzVmG5S++CHR+H6vefsbBy1W8Q/0D05WJqJMnbx1inciO7oGBmnlFL1yNTTTUtzZreOFq3wjpODcHn3o9l/GHMrVhmi4Nb2tHAjybvl2OEh02dKADBqENnWv/yo2/BfzfQNZT6OvuiDUtaFYNOP1lLKrd8bTVqrQDIcQ1+TcIr5qnl+3KhF+WlW+p91K5Dy754f1/WV8GQXHeQr6fmZv+qjrmbOMN1XUD1G+iN7HtmhsshGS8VfxNe2sYR5IlUCF5lpZq0lE7kFJ0fUUFQdKCWbJ3AUAaPDghQ/0KRIjscnTqXtr7apoBmr0cIwEp4EACfYPkp0kviT19c+1dgZASsFZgZr/Mxc9MbDc/CNLX2vVYKV1zb7ouc6HaEHJPlXYlpm57x01GOXqqWvEgIR OPDr0MW9 pWRbrPJ8W9wSvYS6w7R2VuHhhGh3R9mWKNecggKMF31K9ih0DnjBxQ9AGeZUIAbwftV11VoVDafAKfx6Cgjii087cKI/jLrAzvmIr00KQ3wVYyqz7V08Ji/S70Xgh+MJoQiTuHsJN8ZTTOG54MPlqTaDR0QVWJkawQnjxRX96J4o9NcOa9cbkjw9UruQZWs96no34yccNtovd+Xp2UB01yNJQQXNkvTDpYvyb8HSDUhttgXRSkJL9iI/J6DURWwiSTyclagMjJ6gDhDsx2XbDgjyTPwz04SMpSvs4J+MigraEBfb7w9uIuogX0Io7/zblIkL7Mc3n6EPfaS3ECuGPM80KM/L/+WbJqJyzh6ldfNL+tQM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/2/29 16:56, David Hildenbrand wrote: > On 29.02.24 07:09, Peng Zhang wrote: >> From: ZhangPeng >> >> The major fault occurred when using mlockall(MCL_CURRENT | MCL_FUTURE) >> in application, which leading to an unexpected issue[1]. >> >> This caused by temporarily cleared PTE during a read+clear/modify/write >> update of the PTE, eg, do_numa_page()/change_pte_range(). >> >> For the data segment of the user-mode program, the global variable area >> is a private mapping. After the pagecache is loaded, the private >> anonymous >> page is generated after the COW is triggered. Mlockall can lock COW >> pages >> (anonymous pages), but the original file pages cannot be locked and may >> be reclaimed. If the global variable (private anon page) is accessed >> when >> vmf->pte is zeroed in numa fault, a file page fault will be triggered. >> At this time, the original private file page may have been reclaimed. >> If the page cache is not available at this time, a major fault will be >> triggered and the file will be read, causing additional overhead. >> >> This issue affects our traffic analysis service. The inbound traffic is >> heavy. If a major fault occurs, the I/O schedule is triggered and the >> original I/O is suspended. Generally, the I/O schedule is 0.7 ms. If >> other applications are operating disks, the system needs to wait for >> more than 10 ms. However, the inbound traffic is heavy and the NIC >> buffer >> is small. As a result, packet loss occurs. But the traffic analysis >> service >> can't tolerate packet loss. >> >> Fix this by holding PTL and rechecking the PTE in filemap_fault() before >> triggering a major fault. We do this check only if vma is VM_LOCKED. In >> our service test environment, the baseline is 7 major faults / 12 hours. >> After the patch is applied, no major fault will be triggered. >> >> Testing file anonymous page read and write page fault performance in >> ext4, tmpfs and ramdisk using will-it-scale[2] on a x86 physical >> machine. >> The data is the average change compared with the mainline after the >> patch >> is applied. The test results are indicates some performance regressions. >> We do this check only if vma is VM_LOCKED, therefore, no performance >> regressions is caused for most common cases. >> >> The test results are as follows: >>                            processes processes_idle threads threads_idle >> ext4    private file write: -0.51%    0.08%          -0.03% -0.04% >> ext4    shared  file write:  0.135%  -0.531%          2.883% -0.772% >> ramdisk private file write: -0.48%    0.23%          -1.08% 0.27% >> ramdisk private file  read:  0.07%   -6.90%          -5.85% -0.70% >> tmpfs   private file write: -0.344%  -0.110%          0.200% 0.145% >> tmpfs   shared  file write:  0.958%   0.101%          2.781% -0.337% >> tmpfs   private file  read: -0.16%    0.00%          -0.12% 0.41% >> >> [1] >> https://lore.kernel.org/linux-mm/9e62fd9a-bee0-52bf-50a7-498fa17434ee@huawei.com/ >> [2] https://github.com/antonblanchard/will-it-scale/ >> >> Suggested-by: "Huang, Ying" >> Suggested-by: David Hildenbrand >> Signed-off-by: ZhangPeng >> Signed-off-by: Kefeng Wang >> --- >> v2->v3: >> - Do this check only if vma is VM_LOCKED per David Hildenbrand >> - Hold PTL and recheck the PTE >> - Place the recheck code in a new function filemap_fault_recheck_pte() >> >> v1->v2: >> - Add more test results per Huang, Ying >> - Add more comments before check PTE per Huang, Ying, David Hildenbrand >>    and Yin Fengwei >> - Change pte_offset_map_nolock to pte_offset_map as the PTL won't >>    be used >> >> RFC->v1: >> - Add error handling when ptep == NULL per Huang, Ying and Matthew >>    Wilcox >> - Check the PTE without acquiring PTL in filemap_fault(), suggested by >>    Huang, Ying and Yin Fengwei >> - Add pmd_none() check before PTE map >> - Update commit message and add performance test information >> >>   mm/filemap.c | 40 ++++++++++++++++++++++++++++++++++++++++ >>   1 file changed, 40 insertions(+) >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index b4858d89f1b1..2668bac68df7 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -3181,6 +3181,42 @@ static struct file >> *do_async_mmap_readahead(struct vm_fault *vmf, >>       return fpin; >>   } >>   +/* >> + * filemap_fault_recheck_pte - hold PTL and recheck whether pte is >> none. >> + * @vmf - the vm_fault for this fault. >> + * >> + * Recheck PTE as the PTE can be cleared temporarily during a >> read+clear/modify >> + * /write update of the PTE, eg, do_numa_page()/change_pte_range(). >> This will >> + * trigger an unexpected major fault, even if we use mlockall(), >> which may >> + * increase IO and thus cause other unexpected behavior. >> + * >> + * Return VM_FAULT_NOPAGE if the PTE is not none or >> pte_offset_map_lock() >> + * fails. In other cases, 0 is returned. >> + */ > > That documentation is imprecise, as you are not explaining the mlock > limitation. > > It's an internal helper, I'd drop all that and rather add a comment > below right next to the conditions that are performing the check ... > That makes sense. >> +static vm_fault_t filemap_fault_recheck_pte(struct vm_fault *vmf) >> +{ >> +    struct vm_area_struct *vma = vmf->vma; >> +    vm_fault_t ret = 0; >> +    pte_t *ptep; >> + >> +    if (!(vma->vm_flags & VM_LOCKED)) >> +        return ret; > > I was wondering if we also want to do: > > if (!is_cow_mappinng(vma->vm_flags)) >     return 0; > > But likely it's not helpful. > Maybe it's enough to check if the VMA is VM_LOCKED? No performance degradation for most common scenarios. > > Then add something like: > > /* >  * We might have COW'ed a pageache folio and might now have an mlocked Nit: s/pageache/pagecache >  * anon folio mapped. The original pagecache folio is not mlocked and >  * might have been evicted. During a read+clear/modify/write update of >  * the PTE, such as done in do_numa_page()/change_pte_range(), we >  * temporarily clear the PTE under PT lock and might detect it here as >  * "none" when not holding the PT lock. >  * >  * Not rechecking the PTE under PT lock could result in an >  * unexpected major fault in an mlock'ed region. Recheck only for >  * this special scenario while holding the PT lock, to not degrade >  * non-mlocked scenarios. >  */ > Thanks! I'll add these comments in the next version. >> + >> +    if (pmd_none(*vmf->pmd)) >> +        return ret; > > I'd  simply return 0 in both cases, easier to read. > Agreed. >> + >> +    ptep = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, >> +                   &vmf->ptl); >> +    if (unlikely(!ptep)) >> +        return VM_FAULT_NOPAGE; >> + >> +    if (unlikely(!pte_none(ptep_get(ptep)))) >> +        ret = VM_FAULT_NOPAGE; >> + >> +    pte_unmap_unlock(ptep, vmf->ptl); >> +    return ret; >> +} > -- Best Regards, Peng