From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 68677C43334 for ; Tue, 28 Jun 2022 07:52:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0496A8E0003; Tue, 28 Jun 2022 03:52:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F3B828E0001; Tue, 28 Jun 2022 03:52:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DB63A8E0003; Tue, 28 Jun 2022 03:52:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C5FFF8E0001 for ; Tue, 28 Jun 2022 03:52:58 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay13.hostedemail.com (Postfix) with ESMTP id A1D2A6104A for ; Tue, 28 Jun 2022 07:52:58 +0000 (UTC) X-FDA: 79626878436.18.B981497 Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57]) by imf07.hostedemail.com (Postfix) with ESMTP id DB1BF40011 for ; Tue, 28 Jun 2022 07:52:55 +0000 (UTC) X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=guanghuifeng@linux.alibaba.com;NM=1;PH=DS;RN=21;SR=0;TI=SMTPD_---0VHg7YAo_1656402769; Received: from 30.225.28.186(mailfrom:guanghuifeng@linux.alibaba.com fp:SMTPD_---0VHg7YAo_1656402769) by smtp.aliyun-inc.com; Tue, 28 Jun 2022 15:52:51 +0800 Message-ID: <2e76694e-9ead-fc05-c8ad-01646ff02151@linux.alibaba.com> Date: Tue, 28 Jun 2022 15:52:48 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Thunderbird/91.10.0 Subject: Re: [PATCH] arm64: mm: fix linear mapping mem access performace degradation From: "guanghui.fgh" To: "Leizhen (ThunderTown)" , Mike Rapoport Cc: baolin.wang@linux.alibaba.com, catalin.marinas@arm.com, will@kernel.org, akpm@linux-foundation.org, david@redhat.com, jianyong.wu@arm.com, james.morse@arm.com, quic_qiancai@quicinc.com, christophe.leroy@csgroup.eu, jonathan@marek.ca, mark.rutland@arm.com, anshuman.khandual@arm.com, linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, geert+renesas@glider.be, ardb@kernel.org, linux-mm@kvack.org, bhe@redhat.com, Yao Hongbo References: <1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com> <4d18d303-aeed-0beb-a8a4-32893f2d438d@linux.alibaba.com> <32aefb80-c59c-74b6-c373-dd24edba0752@huawei.com> <075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com> <55873a70-da46-b0f6-db81-841a2b5e886a@huawei.com> <54f13945-fa35-247d-ca33-182931fd05ff@linux.alibaba.com> In-Reply-To: <54f13945-fa35-247d-ca33-182931fd05ff@linux.alibaba.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656402777; a=rsa-sha256; cv=none; b=F5zOqtY/1P+g/CkPzFgPyEwAx3sVJQe5tzXUHYI+gREqvECCTFtJT1OCH3FcWbfEG4vtHd 0DG4QddTmfkGQb1rgk3dYGxSJEWWsffC3uYKzAE5gfDCW+mB2agIJIQxpYrCfn54uEdOFw Uf3Y3Ybzuk/UAso+hZoRZYw6m8yhMB4= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of guanghuifeng@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=guanghuifeng@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1656402777; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=9qMuLWF5BkUXIVhluF5T/mWQ8qGHKZoFI5XnEdWEe4A=; b=dnOYbpg/eI2Y2jWKDc9fQw3z3ZDuxb6VuqTpNP7KoRKmKT3R2CvkBx3xL0Jro+wOFlBHuK /324zJYBC9mCtIfXQrM/QeMy6AH1pTuZV39Sxoz9JiXViRuUozs6IQkCh2NF/bD97XM2Rj VlM/bCV6GkohzRVXZDytr37IG8xrEhs= X-Rspam-User: X-Stat-Signature: ypwghgek55k9f49bhmgszh7fm677ra3s X-Rspamd-Queue-Id: DB1BF40011 Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=alibaba.com; spf=pass (imf07.hostedemail.com: domain of guanghuifeng@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=guanghuifeng@linux.alibaba.com X-Rspamd-Server: rspam03 X-HE-Tag: 1656402775-521259 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: 在 2022/6/28 11:06, guanghui.fgh 写道: > Thanks. > > 在 2022/6/28 9:34, Leizhen (ThunderTown) 写道: >> >> >> On 2022/6/27 20:25, guanghui.fgh wrote: >>> Thanks. >>> >>> 在 2022/6/27 20:06, Leizhen (ThunderTown) 写道: >>>> >>>> >>>> On 2022/6/27 18:46, guanghui.fgh wrote: >>>>> >>>>> >>>>> 在 2022/6/27 17:49, Mike Rapoport 写道: >>>>>> Please don't post HTML. >>>>>> >>>>>> On Mon, Jun 27, 2022 at 05:24:10PM +0800, guanghui.fgh wrote: >>>>>>> Thanks. >>>>>>> >>>>>>> 在 2022/6/27 14:34, Mike Rapoport 写道: >>>>>>> >>>>>>>        On Sun, Jun 26, 2022 at 07:10:15PM +0800, Guanghui Feng >>>>>>> wrote: >>>>>>> >>>>>>>            The arm64 can build 2M/1G block/sectiion mapping. When >>>>>>> using DMA/DMA32 zone >>>>>>>            (enable crashkernel, disable rodata full, disable >>>>>>> kfence), the mem_map will >>>>>>>            use non block/section mapping(for crashkernel requires >>>>>>> to shrink the region >>>>>>>            in page granularity). But it will degrade performance >>>>>>> when doing larging >>>>>>>            continuous mem access in kernel(memcpy/memmove, etc). >>>>>>> >>>>>>>            There are many changes and discussions: >>>>>>>            commit 031495635b46 >>>>>>>            commit 1a8e1cef7603 >>>>>>>            commit 8424ecdde7df >>>>>>>            commit 0a30c53573b0 >>>>>>>            commit 2687275a5843 >>>>>>> >>>>>>>        Please include oneline summary of the commit. (See section >>>>>>> "Describe your >>>>>>>        changes" in Documentation/process/submitting-patches.rst) >>>>>>> >>>>>>> OK, I will add oneline summary in the git commit messages. >>>>>>> >>>>>>>            This patch changes mem_map to use block/section >>>>>>> mapping with crashkernel. >>>>>>>            Firstly, do block/section mapping(normally 2M or 1G) >>>>>>> for all avail mem at >>>>>>>            mem_map, reserve crashkernel memory. And then walking >>>>>>> pagetable to split >>>>>>>            block/section mapping to non block/section >>>>>>> mapping(normally 4K) [[[only]]] >>>>>>>            for crashkernel mem. >>>>>>> >>>>>>>        This already happens when ZONE_DMA/ZONE_DMA32 are >>>>>>> disabled. Please explain >>>>>>>        why is it Ok to change the way the memory is mapped with >>>>>>>        ZONE_DMA/ZONE_DMA32 enabled. >>>>>>> >>>>>>> In short: >>>>>>> >>>>>>> 1.building all avail mem with block/section mapping(normally >>>>>>> 1G/2M) without >>>>>>> inspecting crashkernel >>>>>>> 2. Reserve crashkernel mem as same as previous doing >>>>>>> 3. only change the crashkernle mem mapping to normal >>>>>>> mapping(normally 4k). >>>>>>> With this method, there are block/section mapping as more as >>>>>>> possible. >>>>>> >>>>>> This does not answer the question why changing the way the memory >>>>>> is mapped >>>>>> when there is ZONE_DMA/DMA32 and crashkernel won't cause a >>>>>> regression. >>>>>> >>>>> 1.Quoted messages from arch/arm64/mm/init.c >>>>> >>>>> "Memory reservation for crash kernel either done early or deferred >>>>> depending on DMA memory zones configs (ZONE_DMA) -- >>>>> >>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized >>>>> here instead of max_zone_phys().  This lets early reservation of >>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit. >>>>> Reserving memory early for crash kernel allows linear creation of >>>>> block >>>>> mappings (greater than page-granularity) for all the memory bank >>>>> rangs. >>>>> In this scheme a comparatively quicker boot is observed. >>>>> >>>>> If ZONE_DMA configs are defined, crash kernel memory reservation >>>>> is delayed until DMA zone memory range size initialization >>>>> performed in >>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone >>>>> memory range to avoid overlap allocation.  So crash kernel memory >>>>> boundaries are not known when mapping all bank memory ranges, which >>>>> otherwise means not possible to exclude crash kernel range from >>>>> creating block mappings so page-granularity mappings are created >>>>> for the entire memory range." >>>>> >>>>> Namely, the init order: memblock init--->linear mem mapping(4k >>>>> mapping for crashkernel, requirinig page-granularity >>>>> changing))--->zone dma limit--->reserve crashkernel. >>>>> So when enable ZONE DMA and using crashkernel, the mem mapping >>>>> using 4k mapping. >>>>> >>>>> 2.As mentioned above, when linear mem use 4k mapping simply, there >>>>> is high dtlb miss(degrade performance). >>>>> This patch use block/section mapping as far as possible with >>>>> performance improvement. >>>>> >>>>> 3.This patch reserve crashkernel as same as the history(ZONE DMA & >>>>> crashkernel reserving order), and only change the linear mem >>>>> mapping to block/section mapping. >>>>> . >>>>> >>>> >>>> I think Mike Rapoport's probably asking you to answer whether you've >>>> taken into account such as BBM. For example, the following code: >>>> we should prepare the next level pgtable first, then change 2M block >>>> mapping to 4K page mapping, and flush TLB at the end. >>>>> +static void init_crashkernel_pmd(pud_t *pudp, unsigned long addr, >>>> +                 unsigned long end, phys_addr_t phys, >>>> +                 pgprot_t prot, >>>> +                 phys_addr_t (*pgtable_alloc)(int), int flags) >>>> +{ >>>> +    phys_addr_t map_offset; >>>> +    unsigned long next; >>>> +    pmd_t *pmdp; >>>> +    pmdval_t pmdval; >>>> + >>>> +    pmdp = pmd_offset(pudp, addr); >>>> +    do { >>>> +        next = pmd_addr_end(addr, end); >>>> +        if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) { >>>> +            phys_addr_t pte_phys = pgtable_alloc(PAGE_SHIFT); >>>> +            pmd_clear(pmdp); >>>> +            pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN; >>>> +            if (flags & NO_EXEC_MAPPINGS) >>>> +                pmdval |= PMD_TABLE_PXN; >>>> +            __pmd_populate(pmdp, pte_phys, pmdval); >>>> +            flush_tlb_kernel_range(addr, addr + PAGE_SIZE); >>>> >>>> The pgtable is empty now. However, memory other than crashkernel may >>>> be being accessed. >>> 1.When reserving crashkernel and remapping linear mem mapping, there >>> is only one boot cpu running. There is no other cpu/thread running at >>> the same time. >> >> So, put this in the code comment? > OK. >> >> If scalability is considered and unpredictable changes occur in the >> future, for example, >> other modules also need this mapping function. It would be better to >> deal with the BBM now, >> and make this public. > OK, could you give me some advice? >> >> >>> >>> 2.When clearing block/section mapping, I have flush tlb by >>> flush_tlb_kernel_range. Afterwards rebuilt 4k mapping(I think it's no >>> need flush tlb). >> >> >>> >>>> >>>> + >>>> +            map_offset = addr - (addr & PMD_MASK); >>>> +            if (map_offset) >>>> +                alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr, >>>> +                        phys - map_offset, prot, >>>> +                        pgtable_alloc, flags); >>>> + >>>> +            if (next < (addr & PMD_MASK) + PMD_SIZE) >>>> +                alloc_init_cont_pte(pmdp, next, (addr & PUD_MASK) + >>>> +                        PUD_SIZE, next - addr + phys, >>>> +                        prot, pgtable_alloc, flags); >> >> Here and alloc_crashkernel_pud() should use the raw flags. It may not >> contain  (NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS) > Yes. the mem out of crashkernel should use block/section mapping as far > as possible including the LeftMargin and RightMargin. > But I had test it on HiSilicon Kunpeng 920-6426 with it and get > performacne degrade(without NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags for > the left/right margin) > It's strange, could you give some advice? Maybe it's good for other arm > platform except for HiSilicon Kunpeng 920-6426. There should split non-crashkernel mem [[[ without ]]] NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags I had test it on other arm platform [[[ non HiSilicon arm platform ]]] and also get performance improvement greatly. Could you help me to check the difference betweent HiSilicon Kunpeng 920-6426 and other arm platform for the block/section mapping TLB support? >> >>>> +        } >>>> +        alloc_crashkernel_cont_pte(pmdp, addr, next, phys, prot, >>>> +                       pgtable_alloc, flags); >>>> +        phys += next - addr; >>>> +    } while (pmdp++, addr = next, addr != end); >>>> +} >>>> >>> . >>> >>