From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9A0D8C433B4 for ; Fri, 21 May 2021 05:02:11 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1F90A6138C for ; Fri, 21 May 2021 05:02:11 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1F90A6138C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 49126940013; Fri, 21 May 2021 01:02:10 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4406A94000D; Fri, 21 May 2021 01:02:10 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 308F6940013; Fri, 21 May 2021 01:02:10 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0218.hostedemail.com [216.40.44.218]) by kanga.kvack.org (Postfix) with ESMTP id F02F494000D for ; Fri, 21 May 2021 01:02:09 -0400 (EDT) Received: from smtpin21.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 9119C82499A8 for ; Fri, 21 May 2021 05:02:09 +0000 (UTC) X-FDA: 78164041578.21.74C12C8 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf13.hostedemail.com (Postfix) with ESMTP id 6156DE000804 for ; Fri, 21 May 2021 05:02:06 +0000 (UTC) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 58546101E; Thu, 20 May 2021 22:02:03 -0700 (PDT) Received: from [192.168.0.130] (unknown [172.31.20.19]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id D73833F719; Thu, 20 May 2021 22:01:57 -0700 (PDT) Subject: Re: [PATCH] arm64: mm: hugetlb: add support for free vmemmap pages of HugeTLB To: David Hildenbrand , Muchun Song , will@kernel.org, akpm@linux-foundation.org, bodeddub@amazon.com, osalvador@suse.de, mike.kravetz@oracle.com, rientjes@google.com Cc: linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, duanxiongchun@bytedance.com, fam.zheng@bytedance.com, zhengqi.arch@bytedance.com, Mark Rutland , Catalin Marinas , James Morse References: <20210518091826.36937-1-songmuchun@bytedance.com> <1b9d008a-7544-cc85-5c2f-532b984eb5b5@arm.com> <88114091-fbb2-340d-b69b-a572fa340265@redhat.com> <45c1a368-3d31-e92d-f120-4dca0eb2111d@redhat.com> From: Anshuman Khandual Message-ID: Date: Fri, 21 May 2021 10:32:41 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.10.0 MIME-Version: 1.0 In-Reply-To: <45c1a368-3d31-e92d-f120-4dca0eb2111d@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Authentication-Results: imf13.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf13.hostedemail.com: domain of anshuman.khandual@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=anshuman.khandual@arm.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6156DE000804 X-Stat-Signature: nyqwbetjhz5w5359xkf5z94tk4a6azom X-HE-Tag: 1621573326-350525 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 5/20/21 5:29 PM, David Hildenbrand wrote: > On 20.05.21 13:54, Anshuman Khandual wrote: >> >> On 5/19/21 5:33 PM, David Hildenbrand wrote: >>> On 19.05.21 13:45, Anshuman Khandual wrote: >>>> >>>> >>>> On 5/18/21 2:48 PM, Muchun Song wrote: >>>>> The preparation of supporting freeing vmemmap associated with each >>>>> HugeTLB page is ready, so we can support this feature for arm64. >>>>> >>>>> Signed-off-by: Muchun Song >>>>> --- >>>>> =C2=A0=C2=A0 arch/arm64/mm/mmu.c | 5 +++++ >>>>> =C2=A0=C2=A0 fs/Kconfig=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0 | 2 +- >>>>> =C2=A0=C2=A0 2 files changed, 6 insertions(+), 1 deletion(-) >>>>> >>>>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c >>>>> index 5d37e461c41f..967b01ce468d 100644 >>>>> --- a/arch/arm64/mm/mmu.c >>>>> +++ b/arch/arm64/mm/mmu.c >>>>> @@ -23,6 +23,7 @@ >>>>> =C2=A0=C2=A0 #include >>>>> =C2=A0=C2=A0 #include >>>>> =C2=A0=C2=A0 #include >>>>> +#include >>>>> =C2=A0=C2=A0 =C2=A0 #include >>>>> =C2=A0=C2=A0 #include >>>>> @@ -1134,6 +1135,10 @@ int __meminit vmemmap_populate(unsigned long= start, unsigned long end, int node, >>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 pmd_t *pmdp; >>>>> =C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON((start < VMEMMA= P_START) || (end > VMEMMAP_END)); >>>>> + >>>>> +=C2=A0=C2=A0=C2=A0 if (is_hugetlb_free_vmemmap_enabled() && !altma= p) >>>>> +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return vmemmap_populate= _basepages(start, end, node, altmap); >>>> >>>> Not considering the fact that this will force the kernel to have onl= y >>>> base page size mapping for vmemmap (unless altmap is also requested) >>>> which might reduce the performance, it also enables vmemmap mapping = to >>>> be teared down or build up at runtime which could potentially collid= e >>>> with other kernel page table walkers like ptdump or memory hotremove >>>> operation ! How those possible collisions are protected right now ? >>> >>> Hi Anshuman, >>> >>> Memory hotremove is not an issue IIRC. At the time memory is removed,= all huge pages either have been migrated away or dissolved; the vmemmap = is stable. >> >> But what happens when a hot remove section's vmemmap area (which is be= ing >> teared down) is nearby another vmemmap area which is either created or >> being destroyed for HugeTLB alloc/free purpose. As you mentioned HugeT= LB >> pages inside the hot remove section might be safe. But what about othe= r >> HugeTLB areas whose vmemmap area shares page table entries with vmemma= p >> entries for a section being hot removed ? Massive HugeTLB alloc/use/fr= ee >> test cycle using memory just adjacent to a memory hotplug area, which = is >> always added and removed periodically, should be able to expose this p= roblem. >> >> IIUC unlike vmalloc(), vmemap mapping areas in the kernel page table w= ere >> always constant unless there are hotplug add or remove operations whic= h >> are protected with a hotplug lock. Now with this change, we could have >> simultaneous walking and add or remove of the vmemap areas without any >> synchronization. Is not this problematic ? >> >> On arm64 memory hot remove operation empties free portions of the vmem= map >> table after clearing them. Hence all concurrent walkers (hugetlb_vmemm= ap, >> hot remove, ptdump etc) need to be synchronized against hot remove. >> >> =C2=A0From arch/arm64/mm/mmu.c >> >> void vmemmap_free(unsigned long start, unsigned long end, >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0 struct vmem_altmap *altmap) >> { >> #ifdef CONFIG_MEMORY_HOTPLUG >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 WARN_ON((start < VMEM= MAP_START) || (end > VMEMMAP_END)); >> >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 unmap_hotplug_range(s= tart, end, true, altmap); >> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 free_empty_tables(sta= rt, end, VMEMMAP_START, VMEMMAP_END); >> #endif >> } >=20 > You are right, however, AFAIR >=20 > 1) We always populate base pages, meaning we only modify PTEs and not a= ctually add/remove page tables when creating/destroying a hugetlb page. P= age table walkers should be fine and not suddenly run into a use-after-fr= ee. >=20 > 2) For pfn_to_page() users to never fault, we have to do an atomic exch= ange of PTES, meaning, someone traversing a page table looking for pte_no= ne() entries (like free_empty_tables() in your example) should never get = a false positive. >=20 > Makes sense, or am I missing something? >=20 >> >>> >>> vmemmap access (accessing the memmap via a virtual address) itself is= not an issue. Manually walking (vmemmap) page tables might behave >> >> Right. >> >> differently, not sure if ptdump would require any synchronization. >> >> Dumping an wrong value is probably okay but crashing because a page ta= ble >> entry is being freed after ptdump acquired the pointer is bad. On arm6= 4, >> ptdump() is protected against hotremove via [get|put]_online_mems(). >=20 > Okay, and as the feature in question only exchanges PTEs, we should be = fine. Adding Mark, Catalin and James here in case I might have missed something regarding possible vmemmap collision.