From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 68677C43334
	for <linux-mm@archiver.kernel.org>; Tue, 28 Jun 2022 07:52:59 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 0496A8E0003; Tue, 28 Jun 2022 03:52:59 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id F3B828E0001; Tue, 28 Jun 2022 03:52:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id DB63A8E0003; Tue, 28 Jun 2022 03:52:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id C5FFF8E0001
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 03:52:58 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay13.hostedemail.com (Postfix) with ESMTP id A1D2A6104A
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 07:52:58 +0000 (UTC)
X-FDA: 79626878436.18.B981497
Received: from out30-57.freemail.mail.aliyun.com (out30-57.freemail.mail.aliyun.com [115.124.30.57])
	by imf07.hostedemail.com (Postfix) with ESMTP id DB1BF40011
	for <linux-mm@kvack.org>; Tue, 28 Jun 2022 07:52:55 +0000 (UTC)
X-Alimail-AntiSpam:AC=PASS;BC=-1|-1;BR=01201311R171e4;CH=green;DM=||false|;DS=||;FP=0|-1|-1|-1|0|-1|-1|-1;HT=e01e04400;MF=guanghuifeng@linux.alibaba.com;NM=1;PH=DS;RN=21;SR=0;TI=SMTPD_---0VHg7YAo_1656402769;
Received: from 30.225.28.186(mailfrom:guanghuifeng@linux.alibaba.com fp:SMTPD_---0VHg7YAo_1656402769)
          by smtp.aliyun-inc.com;
          Tue, 28 Jun 2022 15:52:51 +0800
Message-ID: <2e76694e-9ead-fc05-c8ad-01646ff02151@linux.alibaba.com>
Date: Tue, 28 Jun 2022 15:52:48 +0800
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Thunderbird/91.10.0
Subject: Re: [PATCH] arm64: mm: fix linear mapping mem access performace
 degradation
From: "guanghui.fgh" <guanghuifeng@linux.alibaba.com>
To: "Leizhen (ThunderTown)" <thunder.leizhen@huawei.com>,
 Mike Rapoport <rppt@kernel.org>
Cc: baolin.wang@linux.alibaba.com, catalin.marinas@arm.com, will@kernel.org,
 akpm@linux-foundation.org, david@redhat.com, jianyong.wu@arm.com,
 james.morse@arm.com, quic_qiancai@quicinc.com, christophe.leroy@csgroup.eu,
 jonathan@marek.ca, mark.rutland@arm.com, anshuman.khandual@arm.com,
 linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org,
 geert+renesas@glider.be, ardb@kernel.org, linux-mm@kvack.org,
 bhe@redhat.com, Yao Hongbo <yaohongbo@linux.alibaba.com>
References: <1656241815-28494-1-git-send-email-guanghuifeng@linux.alibaba.com>
 <YrlPfjv2Wf/C77DI@kernel.org>
 <4d18d303-aeed-0beb-a8a4-32893f2d438d@linux.alibaba.com>
 <Yrl9FcVv1wZ5MnRp@kernel.org>
 <ae5c6c07-1d49-ffd2-6f62-69df4308d0bb@linux.alibaba.com>
 <32aefb80-c59c-74b6-c373-dd24edba0752@huawei.com>
 <075b0a8e-cb7e-70f6-b45a-54cd31886794@linux.alibaba.com>
 <55873a70-da46-b0f6-db81-841a2b5e886a@huawei.com>
 <54f13945-fa35-247d-ca33-182931fd05ff@linux.alibaba.com>
In-Reply-To: <54f13945-fa35-247d-ca33-182931fd05ff@linux.alibaba.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1656402777; a=rsa-sha256;
	cv=none;
	b=F5zOqtY/1P+g/CkPzFgPyEwAx3sVJQe5tzXUHYI+gREqvECCTFtJT1OCH3FcWbfEG4vtHd
	0DG4QddTmfkGQb1rgk3dYGxSJEWWsffC3uYKzAE5gfDCW+mB2agIJIQxpYrCfn54uEdOFw
	Uf3Y3Ybzuk/UAso+hZoRZYw6m8yhMB4=
ARC-Authentication-Results: i=1;
	imf07.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=alibaba.com;
	spf=pass (imf07.hostedemail.com: domain of guanghuifeng@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=guanghuifeng@linux.alibaba.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1656402777;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=9qMuLWF5BkUXIVhluF5T/mWQ8qGHKZoFI5XnEdWEe4A=;
	b=dnOYbpg/eI2Y2jWKDc9fQw3z3ZDuxb6VuqTpNP7KoRKmKT3R2CvkBx3xL0Jro+wOFlBHuK
	/324zJYBC9mCtIfXQrM/QeMy6AH1pTuZV39Sxoz9JiXViRuUozs6IQkCh2NF/bD97XM2Rj
	VlM/bCV6GkohzRVXZDytr37IG8xrEhs=
X-Rspam-User: 
X-Stat-Signature: ypwghgek55k9f49bhmgszh7fm677ra3s
X-Rspamd-Queue-Id: DB1BF40011
Authentication-Results: imf07.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=none) header.from=alibaba.com;
	spf=pass (imf07.hostedemail.com: domain of guanghuifeng@linux.alibaba.com designates 115.124.30.57 as permitted sender) smtp.mailfrom=guanghuifeng@linux.alibaba.com
X-Rspamd-Server: rspam03
X-HE-Tag: 1656402775-521259
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


在 2022/6/28 11:06, guanghui.fgh 写道:
> Thanks.
> 
> 在 2022/6/28 9:34, Leizhen (ThunderTown) 写道:
>>
>>
>> On 2022/6/27 20:25, guanghui.fgh wrote:
>>> Thanks.
>>>
>>> 在 2022/6/27 20:06, Leizhen (ThunderTown) 写道:
>>>>
>>>>
>>>> On 2022/6/27 18:46, guanghui.fgh wrote:
>>>>>
>>>>>
>>>>> 在 2022/6/27 17:49, Mike Rapoport 写道:
>>>>>> Please don't post HTML.
>>>>>>
>>>>>> On Mon, Jun 27, 2022 at 05:24:10PM +0800, guanghui.fgh wrote:
>>>>>>> Thanks.
>>>>>>>
>>>>>>> 在 2022/6/27 14:34, Mike Rapoport 写道:
>>>>>>>
>>>>>>>        On Sun, Jun 26, 2022 at 07:10:15PM +0800, Guanghui Feng 
>>>>>>> wrote:
>>>>>>>
>>>>>>>            The arm64 can build 2M/1G block/sectiion mapping. When 
>>>>>>> using DMA/DMA32 zone
>>>>>>>            (enable crashkernel, disable rodata full, disable 
>>>>>>> kfence), the mem_map will
>>>>>>>            use non block/section mapping(for crashkernel requires 
>>>>>>> to shrink the region
>>>>>>>            in page granularity). But it will degrade performance 
>>>>>>> when doing larging
>>>>>>>            continuous mem access in kernel(memcpy/memmove, etc).
>>>>>>>
>>>>>>>            There are many changes and discussions:
>>>>>>>            commit 031495635b46
>>>>>>>            commit 1a8e1cef7603
>>>>>>>            commit 8424ecdde7df
>>>>>>>            commit 0a30c53573b0
>>>>>>>            commit 2687275a5843
>>>>>>>
>>>>>>>        Please include oneline summary of the commit. (See section 
>>>>>>> "Describe your
>>>>>>>        changes" in Documentation/process/submitting-patches.rst)
>>>>>>>
>>>>>>> OK, I will add oneline summary in the git commit messages.
>>>>>>>
>>>>>>>            This patch changes mem_map to use block/section 
>>>>>>> mapping with crashkernel.
>>>>>>>            Firstly, do block/section mapping(normally 2M or 1G) 
>>>>>>> for all avail mem at
>>>>>>>            mem_map, reserve crashkernel memory. And then walking 
>>>>>>> pagetable to split
>>>>>>>            block/section mapping to non block/section 
>>>>>>> mapping(normally 4K) [[[only]]]
>>>>>>>            for crashkernel mem.
>>>>>>>
>>>>>>>        This already happens when ZONE_DMA/ZONE_DMA32 are 
>>>>>>> disabled. Please explain
>>>>>>>        why is it Ok to change the way the memory is mapped with
>>>>>>>        ZONE_DMA/ZONE_DMA32 enabled.
>>>>>>>
>>>>>>> In short:
>>>>>>>
>>>>>>> 1.building all avail mem with block/section mapping（normally 
>>>>>>> 1G/2M） without
>>>>>>> inspecting crashkernel
>>>>>>> 2. Reserve crashkernel mem as same as previous doing
>>>>>>> 3. only change the crashkernle mem mapping to normal 
>>>>>>> mapping(normally 4k).
>>>>>>> With this method, there are block/section mapping as more as 
>>>>>>> possible.
>>>>>>
>>>>>> This does not answer the question why changing the way the memory 
>>>>>> is mapped
>>>>>> when there is ZONE_DMA/DMA32 and crashkernel won't cause a 
>>>>>> regression.
>>>>>>
>>>>> 1.Quoted messages from arch/arm64/mm/init.c
>>>>>
>>>>> "Memory reservation for crash kernel either done early or deferred
>>>>> depending on DMA memory zones configs (ZONE_DMA) --
>>>>>
>>>>> In absence of ZONE_DMA configs arm64_dma_phys_limit initialized
>>>>> here instead of max_zone_phys().  This lets early reservation of
>>>>> crash kernel memory which has a dependency on arm64_dma_phys_limit.
>>>>> Reserving memory early for crash kernel allows linear creation of 
>>>>> block
>>>>> mappings (greater than page-granularity) for all the memory bank 
>>>>> rangs.
>>>>> In this scheme a comparatively quicker boot is observed.
>>>>>
>>>>> If ZONE_DMA configs are defined, crash kernel memory reservation
>>>>> is delayed until DMA zone memory range size initialization 
>>>>> performed in
>>>>> zone_sizes_init().  The defer is necessary to steer clear of DMA zone
>>>>> memory range to avoid overlap allocation.  So crash kernel memory 
>>>>> boundaries are not known when mapping all bank memory ranges, which 
>>>>> otherwise means not possible to exclude crash kernel range from 
>>>>> creating block mappings so page-granularity mappings are created 
>>>>> for the entire memory range."
>>>>>
>>>>> Namely, the init order: memblock init--->linear mem mapping(4k 
>>>>> mapping for crashkernel, requirinig page-granularity 
>>>>> changing))--->zone dma limit--->reserve crashkernel.
>>>>> So when enable ZONE DMA and using crashkernel, the mem mapping 
>>>>> using 4k mapping.
>>>>>
>>>>> 2.As mentioned above, when linear mem use 4k mapping simply, there 
>>>>> is high dtlb miss(degrade performance).
>>>>> This patch use block/section mapping as far as possible with 
>>>>> performance improvement.
>>>>>
>>>>> 3.This patch reserve crashkernel as same as the history(ZONE DMA & 
>>>>> crashkernel reserving order), and only change the linear mem 
>>>>> mapping to block/section mapping.
>>>>> .
>>>>>
>>>>
>>>> I think Mike Rapoport's probably asking you to answer whether you've
>>>> taken into account such as BBM. For example, the following code:
>>>> we should prepare the next level pgtable first, then change 2M block
>>>> mapping to 4K page mapping, and flush TLB at the end.
>>>>> +static void init_crashkernel_pmd(pud_t *pudp, unsigned long addr,
>>>> +                 unsigned long end, phys_addr_t phys,
>>>> +                 pgprot_t prot,
>>>> +                 phys_addr_t (*pgtable_alloc)(int), int flags)
>>>> +{
>>>> +    phys_addr_t map_offset;
>>>> +    unsigned long next;
>>>> +    pmd_t *pmdp;
>>>> +    pmdval_t pmdval;
>>>> +
>>>> +    pmdp = pmd_offset(pudp, addr);
>>>> +    do {
>>>> +        next = pmd_addr_end(addr, end);
>>>> +        if (!pmd_none(*pmdp) && pmd_sect(*pmdp)) {
>>>> +            phys_addr_t pte_phys = pgtable_alloc(PAGE_SHIFT);
>>>> +            pmd_clear(pmdp);
>>>> +            pmdval = PMD_TYPE_TABLE | PMD_TABLE_UXN;
>>>> +            if (flags & NO_EXEC_MAPPINGS)
>>>> +                pmdval |= PMD_TABLE_PXN;
>>>> +            __pmd_populate(pmdp, pte_phys, pmdval);
>>>> +            flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
>>>>
>>>> The pgtable is empty now. However, memory other than crashkernel may 
>>>> be being accessed.
>>> 1.When reserving crashkernel and remapping linear mem mapping, there 
>>> is only one boot cpu running. There is no other cpu/thread running at 
>>> the same time.
>>
>> So, put this in the code comment?
> OK.
>>
>> If scalability is considered and unpredictable changes occur in the 
>> future, for example,
>> other modules also need this mapping function. It would be better to 
>> deal with the BBM now,
>> and make this public.
> OK, could you give me some advice?
>>
>>
>>>
>>> 2.When clearing block/section mapping, I have flush tlb by 
>>> flush_tlb_kernel_range. Afterwards rebuilt 4k mapping(I think it's no 
>>> need flush tlb).
>>
>>
>>>
>>>>
>>>> +
>>>> +            map_offset = addr - (addr & PMD_MASK);
>>>> +            if (map_offset)
>>>> +                alloc_init_cont_pte(pmdp, addr & PMD_MASK, addr,
>>>> +                        phys - map_offset, prot,
>>>> +                        pgtable_alloc, flags);
>>>> +
>>>> +            if (next < (addr & PMD_MASK) + PMD_SIZE)
>>>> +                alloc_init_cont_pte(pmdp, next, (addr & PUD_MASK) +
>>>> +                        PUD_SIZE, next - addr + phys,
>>>> +                        prot, pgtable_alloc, flags);
>>
>> Here and alloc_crashkernel_pud() should use the raw flags. It may not 
>> contain  (NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS)
> Yes. the mem out of crashkernel should use block/section mapping as far 
> as possible including the LeftMargin and RightMargin.
> But I had test it on HiSilicon Kunpeng 920-6426 with it and get 
> performacne degrade(without NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags for 
> the left/right margin)
> It's strange, could you give some advice? Maybe it's good for other arm 
> platform except for HiSilicon Kunpeng 920-6426.
There should split non-crashkernel mem [[[ without ]]]
NO_BLOCK_MAPPINGS/NO_CONT_MAPPINGS flags

I had test it on other arm platform [[[ non HiSilicon arm platform ]]] 
and also get performance improvement greatly.

Could you help me to check the difference betweent HiSilicon Kunpeng 
920-6426 and other arm platform for the block/section mapping TLB support?

>>
>>>> +        }
>>>> +        alloc_crashkernel_cont_pte(pmdp, addr, next, phys, prot,
>>>> +                       pgtable_alloc, flags);
>>>> +        phys += next - addr;
>>>> +    } while (pmdp++, addr = next, addr != end);
>>>> +}
>>>>
>>> .
>>>
>>