From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B284BC2BBCA for ; Tue, 25 Jun 2024 12:23:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 336716B02FD; Tue, 25 Jun 2024 08:23:30 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2E4A76B02FE; Tue, 25 Jun 2024 08:23:30 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D4106B02FF; Tue, 25 Jun 2024 08:23:30 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id ED2BC6B02FD for ; Tue, 25 Jun 2024 08:23:29 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A0E291A11CE for ; Tue, 25 Jun 2024 12:23:29 +0000 (UTC) X-FDA: 82269326538.18.41651D3 Received: from szxga05-in.huawei.com (szxga05-in.huawei.com [45.249.212.191]) by imf27.hostedemail.com (Postfix) with ESMTP id 556574000A for ; Tue, 25 Jun 2024 12:23:25 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719318188; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=n77NiSkltj8tVW1l3tMrXjcvu/hKZrmi765FVMScEjE=; b=lztslv+hTWkpbgkUtmRobOPSjZqg4GRv58drODf2a2ca8h2eOemdWWiYcgwTPJO1yYzc8U 4+AVffSDxVLAIxtAt92ZPAQyjuUYPSkKo+KSFZik5ZdltEj1yG/zV7IzeLQG+4TkJgA6Mf j6vLiDv2WA4lrvGhLkrXXwhuJMlZPCs= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=none; spf=pass (imf27.hostedemail.com: domain of wangkefeng.wang@huawei.com designates 45.249.212.191 as permitted sender) smtp.mailfrom=wangkefeng.wang@huawei.com; dmarc=pass (policy=quarantine) header.from=huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719318188; a=rsa-sha256; cv=none; b=Fpu4EqWi6l3It3Ul1nRkWpO4YWY9x2Ms/8OBH2GI5JCHlU9JUx2zR4H4BpXmGvTRTSMw3Q uyXHtIQVmGoWLhU89Vwa1s3gwj2xSxBBjbutJxSi1q7Y+8KGil7lGA9NTjCwpgivQNf7wi C0yEF++USglXJTTnoBSeAx0kxrphGlc= Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga05-in.huawei.com (SkyGuard) with ESMTP id 4W7kQz2cbWz1j5kH; Tue, 25 Jun 2024 20:19:23 +0800 (CST) Received: from dggpemf100008.china.huawei.com (unknown [7.185.36.138]) by mail.maildlp.com (Postfix) with ESMTPS id 576A8140381; Tue, 25 Jun 2024 20:23:22 +0800 (CST) Received: from [10.174.177.243] (10.174.177.243) by dggpemf100008.china.huawei.com (7.185.36.138) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Tue, 25 Jun 2024 20:23:21 +0800 Message-ID: <99aa61b6-afc9-445f-8f50-1e017450efd1@huawei.com> Date: Tue, 25 Jun 2024 20:23:20 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings Content-Language: en-US To: Baolin Wang , Ryan Roberts , Catalin Marinas , Will Deacon , Ard Biesheuvel , Marc Zyngier , James Morse , Andrey Ryabinin , Andrew Morton , Matthew Wilcox , Mark Rutland , David Hildenbrand , John Hubbard , Zi Yan , Barry Song <21cnbao@gmail.com>, Alistair Popple , Yang Shi , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , "H. Peter Anvin" , "Yin, Fengwei" CC: , , , , References: <20240215103205.2607016-1-ryan.roberts@arm.com> <20240215103205.2607016-19-ryan.roberts@arm.com> <1285eb59-fcc3-4db8-9dd9-e7c4d82b1be0@huawei.com> <8d57ed0d-fdd0-4fc6-b9f1-a6ac11ce93ce@arm.com> <018b5e83-789e-480f-82c8-a64515cdd14a@huawei.com> From: Kefeng Wang In-Reply-To: Content-Type: text/plain; charset="UTF-8"; format=flowed Content-Transfer-Encoding: 8bit X-Originating-IP: [10.174.177.243] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemf100008.china.huawei.com (7.185.36.138) X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 556574000A X-Stat-Signature: eydxpi81wexbxybpr5je1oxzr35tur19 X-Rspam-User: X-HE-Tag: 1719318205-52936 X-HE-Meta: U2FsdGVkX19RnIUM3BhcaeEne0UbE+hsL+67A+3iOefstWUE/APHhrhRONzR+qrR/eaqofPg4pUGFBtWpe3i5f8Xhoqkg9g2p3pmMRRjFQhHPIB92Q6bkYY0eT26qYxwLpb2L8QLmpjGEQpOP7sUWgmjuQFcPxJkexkSv8UzIUTBPEWvhUX4x5WG2HwtnEC4U/A3kx+HwTZMuWN9/42RY9uoaIy5Ixqf73IFjhaWs0Sp9aKyMmYkPRDwFTiFrgdiuvAofKAGfTg9pE2DLASv8G4dz/iMAyEpBY4kuZ4Y/pyluzsrWeAPgmvlAK20DdbzI5aBsZH4Aa7cC96j7j36ZWcGVWomtuyqu2uo4v32HaFPq6Gh114QizqtWtFUGclEgvzeHL5cCNcSXNqP6QNayY6Y53Bq02yEfumOd37J6jlu0Gd9rKZvDODN7x5fVM5qE+c7pTQNvVIyGSookhSwC/M2Qr6QoKxTy+TqjE6m5Tj8kRdNPIECepWbVW0KuUC5v/26Q8xkmQ5OVdMVaJtdGQVrO0rfHkDsVEvBzMu6u8SLfQxfKMMDeU34m2XCK55AdyoRekTNF+ZXbGbVFqbD11hNSPubTE84iZV+Nvv55EveN+PVrC1ZNa4pSdpAl+sHYNrWQ63kP0M5p5p97bUUsFNDF5t4daNFH2QcI5StrDfX1heWy9GPNcmKTdz09rM/etnZv+zuv0KZ3joULPfL4UdhKdcrMecdLP2ViYNY1uw/zFaO4/3E9kQTmupnUInqcYat8k2KigEe11GB5x2EIaWgp1l8yuHHk8TZVGmVmIfqtI18P6sja+OKE0do586oAvapBLjBNIbYnhFFpR3iikPz69DGo0895i0dus04WhDNjFH2132TRB71QyHBeETG6jz7LLWIu9vIvrrbc41/uqRkXxJoc+JJwtBytKDCuEihcVN/hbLKmUImAS2ceD5OJORON/DiEYUfbVSP8H3 41Ac8gS5 ickEUAHf9ONKaHA1jQBCWCAH0jISXdezR9vwtT1t6Fz548witknlSodGry3aVfCCn6UwQiXOX5lzvxxKj0VBGXBPw6S+hAsuBaeTHSMK2wLG+NVrUStMc8ejke1L/jJKpq4/DnZhEG2BTlnr0BX6Ifs+l/NOtcdcAk3/Ilui59Ji4Y/IuiIFc7vY5epRXOlUhLnXjqkkR6+/VZUOaB/0kdNCdoqzipibU6YASm1hf41F+hDhD9W9a50o1sQYcBaQI7EqvFTFjYHGJrsqW+rERp4D2isHWIInqmM/pRvpQlJ+sCD1L7v02YxSfaTdZLZUilVUjP2K0g3kiKDTrHpl1TRwe96te7lGyCZGTVsaTurMCLd+qBQaNQ1c1vRKnXyIOyjs0pTeREqhhlzLYbNdYturgFcp+FuS2S6FH4EqtWjbuPp++S19u3Y9LEoMKXrnkyWjUAqn95wQlOF8= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/6/25 15:23, Baolin Wang wrote: > > > On 2024/6/25 11:16, Kefeng Wang wrote: >> >> >> On 2024/6/24 23:56, Ryan Roberts wrote: >>> + Baolin Wang and Yin Fengwei, who maybe able to help with this. >>> >>> >>> Hi Kefeng, >>> >>> Thanks for the report! >>> >>> >>> On 24/06/2024 15:30, Kefeng Wang wrote: >>>> Hi Ryan, >>>> >>>> A big regression on page-fault3("Separate file shared mapping page >>>> fault") testcase from will-it-scale on arm64, no issue on x86, >>>> >>>> ./page_fault3_processes -t 128 -s 5 >>> >>> I see that this program is mkstmp'ing a file at >>> "/tmp/willitscale.XXXXXX". Based >>> on your description, I'm inferring that /tmp is backed by ext4 with >>> your large >>> folio patches enabled? >> >> Yes, mount /tmp by ext4, sorry to forget to mention that. >> >>> >>>> >>>> 1) large folio disabled on ext4: >>>>     92378735 >>>> 2) large folio  enabled on ext4 +  CONTPTE enabled >>>>     16164943 >>>> 3) large folio  enabled on ext4 +  CONTPTE disabled >>>>     80364074 >>>> 4) large folio  enabled on ext4 +  CONTPTE enabled + large folio >>>> mapping enabled >>>> in finish_fault()[2] >>>>     299656874 >>>> >>>> We found *contpte_convert* consume lots of CPU(76%) in case 2), >>> >>> contpte_convert() is expensive and to be avoided; In this case I >>> expect it is >>> repainting the PTEs with the PTE_CONT bit added in, and to do that it >>> needs to >>> invalidate the tlb for the virtual range. The code is there to mop up >>> user space >>> patterns where each page in a range is temporarily made RO, then >>> later changed >>> back. In this case, we want to re-fold the contpte range once all >>> pages have >>> been serviced in RO mode. >>> >>> Of course this path is only intended as a fallback, and the more >>> optimium >>> approach is to set_ptes() the whole folio in one go where possible - >>> kind of >>> what you are doing below. >>> >>>> and disappeared >>>> by following change[2], it is easy to understood the different >>>> between case 2) >>>> and case 4) since case 2) always map one page >>>> size, but always try to fold contpte mappings, which spend a lot of >>>> time. Case 4) is a workaround, any other better suggestion? >>> >>> See below. >>> >>>> >>>> Thanks. >>>> >>>> [1] https://github.com/antonblanchard/will-it-scale >>>> [2] enable large folio mapping in finish_fault() >>>> >>>> diff --git a/mm/memory.c b/mm/memory.c >>>> index 00728ea95583..5623a8ce3a1e 100644 >>>> --- a/mm/memory.c >>>> +++ b/mm/memory.c >>>> @@ -4880,7 +4880,7 @@ vm_fault_t finish_fault(struct vm_fault *vmf) >>>>           * approach also applies to non-anonymous-shmem faults to >>>> avoid >>>>           * inflating the RSS of the process. >>>>           */ >>>> -       if (!vma_is_anon_shmem(vma) || >>>> unlikely(userfaultfd_armed(vma))) { >>>> +       if (unlikely(userfaultfd_armed(vma))) { >>> >>> The change to make finish_fault() handle multiple pages in one go are >>> new; added >>> by Baolin Wang at [1]. That extra conditional that you have removed >>> is there to >>> prevent RSS reporting bloat. See discussion that starts at [2]. >>> >>> Anyway, it was my vague understanding that the fault around mechanism >>> (do_fault_around()) would ensure that (by default) 64K worth of pages >>> get mapped >>> together in a single set_ptes() call, via filemap_map_pages() -> >>> filemap_map_folio_range(). Looking at the code, I guess fault around >>> only >>> applies to read faults. This test is doing a write fault. >>> >>> I guess we need to do a change a bit like what you have done, but >>> also taking >>> into account fault_around configuration? > > For the writable mmap() of tmpfs, we will use mTHP interface to control > the size of folio to allocate, as discussed in previous meeting [1], so > I don't think fault_around configuration will be helpful for tmpfs. Yes, tmpfs is different from ext4. > > For other filesystems, like ext4, I did not found the logic to determin > what size of folio to allocate in writable mmap() path (Kefeng, please > correct me if I missed something). If there is a control like mTHP, we > can rely on that instead of 'fault_around'? For ext4 or most filesystems, the folio is allocated from filemap_fault(), we don't have explicit interface like mTHP to control the folio size. > > [1] > https://lore.kernel.org/all/f1783ff0-65bd-4b2b-8952-52b6822a0835@redhat.com/ > >> Yes, the current changes is not enough, I hint some issue and still >> debugging, so our direction is trying to map large folio for >> do_shared_fault(), right? > > I think this is the right direction to do. I add this > '!vma_is_anon_shmem(vma)' conditon to gradually implement support for > large folio mapping buidling, especially for writable mmap() support in > tmpfs. > >>> [1] >>> https://lore.kernel.org/all/3a190892355989d42f59cf9f2f98b94694b0d24d.1718090413.git.baolin.wang@linux.alibaba.com/ >>> [2] >>> https://lore.kernel.org/linux-mm/13939ade-a99a-4075-8a26-9be7576b7e03@arm.com/