From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C943CC3DA64 for ; Fri, 26 Jul 2024 02:31:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3E1DC6B008C; Thu, 25 Jul 2024 22:31:21 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 391FA6B0092; Thu, 25 Jul 2024 22:31:21 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 259856B0098; Thu, 25 Jul 2024 22:31:21 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 097A46B008C for ; Thu, 25 Jul 2024 22:31:21 -0400 (EDT) Received: from smtpin25.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 749F3140E9C for ; Fri, 26 Jul 2024 02:31:20 +0000 (UTC) X-FDA: 82380327120.25.B57408B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id ACD20C001C for ; Fri, 26 Jul 2024 02:31:18 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cuOTMM06; spf=pass (imf22.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721961053; a=rsa-sha256; cv=none; b=J0ZA1sQMmVBhtt5BKivRxRHc9OOwn1xGKvVwC/W/iIJ1LoNLJ4SAlI4lFEkwBHhkTd6uJc IyxokG/7O6ZblmSUK6KS5f9ESbkY/BZuAVs9Bmbb8VjuXh0sntkUDP5bIkZrqZFC9/LbTX l1iZwK1AKPPpETGCZcETRkzZFYFjgMA= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cuOTMM06; spf=pass (imf22.hostedemail.com: domain of bhe@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=bhe@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721961053; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=oCQEKBxu2ULMb9kbDdpPsJ+iaBtA661FLtdvoe2GoJw=; b=SkDYnh9w7y4KUrP49Fpcd8eb9ihPvqr7YTeRkM9B8Ix5vMl3dpa0Zbxpkd29m9m6c1re3U I0cDVQaHuuOjsuC9C7hoi8HjNrKScIyaeAxF8uQ9dCo+HlSE1orxGho2cPDTGobvs+WShO 9uyrYxbW3p+f279zqfF10wGISvNP0VI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1721961078; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=oCQEKBxu2ULMb9kbDdpPsJ+iaBtA661FLtdvoe2GoJw=; b=cuOTMM06bol7cyek63lbYMHjVWTsB1jv1+RMT9kvTOFHzszkKkxOhU6+AARNl/22qYZLbk /TjNwKy2JxVKzBaFCMbYbtvNM2HPiHp0iX1v+9jdaBxFVXj0GyC2OBbS1QU+lDWa4eWNcC rfFdCgLBDLmUDsyl6QvNC0hnIEvtcYU= Received: from mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-332-tL63dCGRNqab7IOa8ZXCpg-1; Thu, 25 Jul 2024 22:31:13 -0400 X-MC-Unique: tL63dCGRNqab7IOa8ZXCpg-1 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 2D46F1955D52; Fri, 26 Jul 2024 02:31:11 +0000 (UTC) Received: from localhost (unknown [10.72.112.25]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 231111955D42; Fri, 26 Jul 2024 02:31:08 +0000 (UTC) Date: Fri, 26 Jul 2024 10:31:03 +0800 From: Baoquan He To: Hailong Liu Cc: Barry Song <21cnbao@gmail.com>, Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Vlastimil Babka , Michal Hocko , Matthew Wilcox , Tangquan Zheng , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 Message-ID: References: <20240725035318.471-1-hailong.liu@oppo.com> <20240725164003.ft6huabwa5dqoy2g@oppo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240725164003.ft6huabwa5dqoy2g@oppo.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 X-Stat-Signature: knrazwjftranyfaggcqbgd94zgwduj1e X-Rspamd-Queue-Id: ACD20C001C X-Rspam-User: X-Rspamd-Server: rspam10 X-HE-Tag: 1721961078-115687 X-HE-Meta: U2FsdGVkX1/HQqcoMYzNnlZge87kkt4uimZSC4eWH9vXPj+W0wIKw6rp2Q7KNGDAHyttoaB/QIucQECVjp6s36m9fEGpnCwp/uQxVznRv72bJTJziSjttKMSWdrOPLYYdaZ4/2O8yYako6YPX3fEEa5LUoQRUkHurSAp9799JiKkmA0tFKysX8+cL3H2EL2aujJaoLdSu+fjVID3DBHL4UnVY8dmbBcwRd+b8v0XD6twgbDyNskCVWcqquVkPHuDdjbF9oe50MO0KXE5fotYf8CeeYk/xbc4cEK4xe6zUYPMmQ9d+SyXtMa79Q6QMQJTZR+93iQNBVB/5UlYTysf6QOEZwGdLUBwiQSmAMpCXs/2KPMfh7sbr7gkyAmbID4UPeelCnn/ZhK2ZE/cCN/doYqkVvWKuLi1Q4SbsJQU8d4kRe7Bjl+7WlJmvTFhhYY+rH8RnwxjppKmHooKMeVPB1wibgCaX8u+uKK76jRmA46ZDjlpwtZzRLNP75PCfK5eO378F4bX/2H8oN/esWX9bQxFrK2S8hWo1LLu0ztuW76uEWiH6D0Wh03XuP8VR08c2Oeye6G3oiGnJ2DAZz4hCXClSHT29lLPuvLoqJpu7Y7uS/2pk8sma69jWAhievHG4PeWrn3cE9V89M9zCQnFjs02FeGGWOMPaoDnpG2zqq47+M6OzYfWg9YTouQhiwK1pbxOMAnetYF3lukN1PlGDmEnbbmdjVH3Ie/DmGaI2Bgbx6WOVSztM3x0wsOQ7EQpFj8NeSRhSSM4CdJC7hVBuoNyemIgb4QVoNZ9to7xAcuRFKglGjtctt4WP+A8aLbTLNYkj6J6klSKPNHeVNX8dzGKPO+IBbWdOXiQ665h79EUgn4pqK7Vtj1wrvxb6be8p7M84myU5zeBubm1ge1lmI5QLyRDyX7QcbbPBBH5qgwI81E2+/blog7ziKzwJhm++VMEICuY+Wp5TFQPYrQ hN1+3cBy S4dP3JScaYwPd9BhpvT15J0Sxh7duodyU5NAgOsp4dOwhds+vXYXPP3tN5mj3YwSJq/zsozPAxiSScGQY4O5dqcdqi7wd3r/csP6PbOYr8XnqhokbzFDIk+KCt/96CHgivNSnl5b7DF6DxtTQpCK80zAGJcZZHwpQ8Nn1o/QVVXz+Hp2D9p4Jv3hVh2bs3dObJO/j7v+MY4ovJOTgRIIr5ytL9sEKRZL+YE0ue/o1P2hFvdTb/BZyYfVjzBaGMg8h6MR1DtArx/kl0O+Y0Xnkx7QNsoWUExDoif4XYq5qeptejYeyTgrkGfjUE99/9Bc65KzLd5EUju7SmEGePdVYqWgMJqki3IXA2yALENw0ojYJwVTMGXxdLbuAjoEKUedctsBTulL80EMZhNedAmC2El2yAD/fnInuLPOcGy95ByMFzCFTq92OaNtLIw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/26/24 at 12:40am, Hailong Liu wrote: > On Thu, 25. Jul 19:39, Baoquan He wrote: > > On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote: > > > From: "Hailong.Liu" > > > > > > The scenario where the issue occurs is as follows: > > > CONFIG: vmap_allow_huge = true && 2M is for PMD_SIZE > > > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX) > > > __vmalloc_node_range(vm_flags=VM_ALLOW_HUGE_VMAP) > > > vm_area_alloc_pages(order=9) --->allocs order9 failed and fallback to order0 > > > and phys_addr is aligned with PMD_SIZE > > > vmap_pages_range > > > vmap_pages_range_noflush > > > __vmap_pages_range_noflush(page_shift = 21) ----> incorrect vmap *huge* here > > > > > > In fact, as long as page_shift is not equal to PAGE_SHIFT, there > > > might be issues with the __vmap_pages_range_noflush(). > > > > > > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There > > > are several reasons for this: > > > - This increases memory footprint because ALIGNMENT. > > > - This increases the likelihood of kvmalloc allocation failures. > > > - Without this it fixes the origin issue of kvmalloc with __GFP_NOFAIL may return NULL. > > > Besides if drivers want to vmap huge, user vmalloc_huge instead. > > > > Seem there are two issues you are folding into one patch: > Got it. I will separate in the next version. > > > > > one is the wrong informatin passed into __vmap_pages_range_noflush(); > > the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc(). > > > > About the 1st one, do you think below draft is OK to you? > > > > Pass out the fall back order and adjust the order and shift for later > > usage, mainly for vmap_pages_range(). > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index 260897b21b11..5ee9ae518f3d 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn); > > > > static inline unsigned int > > vm_area_alloc_pages(gfp_t gfp, int nid, > > - unsigned int order, unsigned int nr_pages, struct page **pages) > > + unsigned int *page_order, unsigned int nr_pages, struct page **pages) > > { > > - unsigned int nr_allocated = 0; > > + unsigned int nr_allocated = 0, order = *page_order; > > gfp_t alloc_gfp = gfp; > > bool nofail = gfp & __GFP_NOFAIL; > > struct page *page; > > @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > > cond_resched(); > > nr_allocated += 1U << order; > > } > > + *page_order = order; > > > > return nr_allocated; > > } > > @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > page_order = vm_area_page_order(area); > > > > area->nr_pages = vm_area_alloc_pages(gfp_mask | __GFP_NOWARN, > > - node, page_order, nr_small_pages, area->pages); > > + node, &page_order, nr_small_pages, area->pages); > > > > atomic_long_add(area->nr_pages, &nr_vmalloc_pages); > > if (gfp_mask & __GFP_ACCOUNT) { > > @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask, > > goto fail; > > } > > > > + > > + set_vm_area_page_order(area, page_order); > > + page_shift = page_order + PAGE_SHIFT; > > + > > /* > > * page tables allocations ignore external gfp mask, enforce it > > * by the scope API > > > The logic of this patch is somewhat similar to my first one. If high order > allocation fails, it will go normal mapping. > > However I also save the fallback position. The ones before this position are > used for huge mapping, the ones >= position for normal mapping as Barry said. > "support the combination of PMD and PTE mapping". this will take some > times as it needs to address the corner cases and do some tests. Hmm, we may not need to worry about the imperfect mapping. Currently there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof() and vmalloc_huge(). For vmalloc_huge(), it's called in below three interfaces which are all invoked during boot. Basically they can succeed to get required contiguous physical memory. I guess that's why Tangquan only spot this issue on kvmalloc invocation when the required size exceeds e.g 2M. For kvmalloc_node(), we have told that in the code comment above __kvmalloc_node_noprof(), it's a best effort behaviour. mm/mm_init.c <> table = vmalloc_huge(size, gfp_flags); net/ipv4/inet_hashtables.c <> new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket), net/ipv4/udp.c <> udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot) Maybe we should add code comment or document to notice people that the contiguous physical pages are not guaranteed for vmalloc_huge() if you use it after boot. > > IMO, the draft can fix the current issue, it also does not have significant side > effects. Barry, what do you think about this patch? If you think it's okay, > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the > other to address the current mapping issue. > > -- > help you, help me, > Hailong. >