From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E2C2EC3DA7F for ; Fri, 26 Jul 2024 09:29:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 635E76B007B; Fri, 26 Jul 2024 05:29:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5E6346B0096; Fri, 26 Jul 2024 05:29:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 486496B0099; Fri, 26 Jul 2024 05:29:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 2A5886B007B for ; Fri, 26 Jul 2024 05:29:41 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id BB109C1732 for ; Fri, 26 Jul 2024 09:29:40 +0000 (UTC) X-FDA: 82381381320.13.7E960DE Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf07.hostedemail.com (Postfix) with ESMTP id E05E64002F for ; Fri, 26 Jul 2024 09:29:38 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LIw1f3qz; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721986142; a=rsa-sha256; cv=none; b=AD+KsdW9/cnzD0Lo6CwVHaG7wA8UiseuvtXfgeKekk1y2OSRIvN9B/ofm8kQXCdrwamcNB uY+IND1W9B03QibLev71XIL0tmxT6hZ+lO/8cIxhrZn9EbyaINJPzgkczBDslCACXSLyX8 qY9vMgN0IqBx7t4HsEUQ2TMz1/UBHIE= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LIw1f3qz; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of bhe@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=bhe@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721986142; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=WJTdMoDMeV6ZkZEHxIrp+DXkTDs07/EHR6SJpwQn7MI=; b=ysEFAM591JCvhPTPDnLJeZcUmzZ5YSL9TwGpFgodyOnUKSSGTGpWvLYvbZ4/ecz7ATN2pP ixB9WtUgprR5g7HzKWm8mC7yMUEESt+zcxoZao1ArhtpK/EOid39+iNkfqkyvEVGW9ly1n +M1Gs/4khP8LLoiphkSSqKPxPgeU4Xo= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1721986178; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=WJTdMoDMeV6ZkZEHxIrp+DXkTDs07/EHR6SJpwQn7MI=; b=LIw1f3qzsMqd3saSEhoYXJZFp+wsIux8yHNdDyVx6TnAOHzPIFuLstWv4glHHKVMzXu96I XK+4XKee4ssdxtnHDfmuESRdSlh4ZRFnsa1DDtNcUMBOMyvF3YP99gf5/a2BI/5y43Vtso I4lBYdKzCREhrk7K3Lnkf5fpqfxXWuQ= Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-13-LbZVJEjsNrK4clV6Ti2IRA-1; Fri, 26 Jul 2024 05:29:35 -0400 X-MC-Unique: LbZVJEjsNrK4clV6Ti2IRA-1 Received: from mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.17]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 9E9EE1955D4C; Fri, 26 Jul 2024 09:29:33 +0000 (UTC) Received: from localhost (unknown [10.72.112.25]) by mx-prod-int-05.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 5C3191955F40; Fri, 26 Jul 2024 09:29:30 +0000 (UTC) Date: Fri, 26 Jul 2024 17:29:26 +0800 From: Baoquan He To: Hailong Liu Cc: Barry Song <21cnbao@gmail.com>, Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Vlastimil Babka , Michal Hocko , Matthew Wilcox , Tangquan Zheng , linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 Message-ID: References: <20240725035318.471-1-hailong.liu@oppo.com> <20240725164003.ft6huabwa5dqoy2g@oppo.com> <20240726040052.hs2gvpktrnlbvhsq@oppo.com> <20240726050356.ludmpxfee6erlxxt@oppo.com> <20240726084809.gdz2axvawwwekpu6@oppo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20240726084809.gdz2axvawwwekpu6@oppo.com> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.17 X-Rspamd-Queue-Id: E05E64002F X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 7fj4pimxffn6dizs7c8qj5cfawzxm45k X-HE-Tag: 1721986178-118261 X-HE-Meta: U2FsdGVkX19xzl1ttmvegAfrVZfPU7fqLLcqg2FcljVS86PSu8f0CiDuI8PycqjZnoMm5gLFYB4JsI8Zv0PYxY9quioYh8CToK40tKIO4yG0dInFypq6FFdbSKpjApjseubP9cRuuHb1xqejXAuH4Qolv2xep8SkwyWx+YYLAdJIPFNCddjYGqk4gHHRGBvP8tujgWVVTwFfDQCgEtzzVp+XZzwnPgVfpAC2xo+TOQUg9069D+W101HOIXGoQ0rIddaVvLDxcIJF7yuwwWmzgjQGOGgvPEYDZ67o8QImpGWP2VaxagrC2y8t7tB89ZU7eDDxqiftiOStigprn6UAhnIU4Q5VivVOnaVIGwGVIZ4ije5kf3QumYDIL9VGMDR1AAbVPgHO7EkUVi23ZPwUTHdsoAKmmkUpyh1Tun1GWjWGByOnD7WepJYWge0LCjiBl2aju3GWY/ZYnFbuTl0y1E0XjAApbaKiKUEcTFP5DpH7tQ3tVG5vm3+A5dUx08rncmLJzUnV+s9WZr4YvaXbkU2BCn+ecsyU/q/WjBz6h2nSvxz8+eddpHzLgU5jCql27bPCeRusI/JZ4sy3+phm/dcy+KH3rpOQ51YsdauMJBMioMX5cAxjv0ObvT3cu0mde1HfpYwUzj494rmdbNFQBIMpZJ5wyWMNiUY0umIjRrJBEq6DliBm54h9B26iXLxPl0oxdbi9c1m8ROjRWX6OAcXKdELZkOUkUdiccSfcii+oykFR0f1b1+GPXUzCi/2sgBopOrq/x4Q0mf8h7fPvYJ6Zh+ETSAULmoPC5iP/YCI2Kb6RxxjR18e75wocT9RVj95MTHI3HQpYooOij2jOxYG70Zl03WLSRgTPyx7MrsXVTWI3eAV1kLWYYHWXthZEep+2rV6OIjuhnvJIQO03kEN8wPnwX2737K+f9LeFrs+1pGGH0aqM4oy7eix1gR/JOyPdhWWZTAsGx2gOKyV sJNAqMT4 gdYqDOfRM9XKMvlaLYvHJYGEcA5NWdSRNyLvdnDgj5GkR5ikizFpgntSX1QSX3qH7eqRZQH/dnWTO/yvX3FPpUdry0KjN+FBeErfWzhEZFq6iQ4EGjzZ8jxeygmNPdj9B+cQWVo17EJrpNfZ5DIBejgI58H3fMhR8U+0iiyNDB9FGpspNn6mcKdBOkKvoF2ATXa0AFjIAmzhWCdqjsGG7J4c8v7lb4EcxESUYj+dfkDXXache7QqN91jg4EhTvejG/lsXMGwP5uRmDL8SSTtFAPrCLrhR4X/Dpd2mGlNbs4j1Y08NHzeN0z6Ia7cs8HY6+2QC+W158uw3Bfj2xnmNuYglhMmwxBX/ygXQMIgac1v5g/jJX3AvGriHWD9ojjhMjKttlTjN1LaseCetDjyrk/nElYzZNDGczUkPkkvedSPEy8lYZMyDg5H9rdg5JW2l1XGvRx08sv6DGUw= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 07/26/24 at 04:48pm, Hailong Liu wrote: > On Fri, 26. Jul 16:37, Baoquan He wrote: > > On 07/26/24 at 05:29pm, Barry Song wrote: > > > On Fri, Jul 26, 2024 at 5:04 PM Hailong Liu wrote: > > > > > > > > On Fri, 26. Jul 12:00, Hailong Liu wrote: > > > > > On Fri, 26. Jul 10:31, Baoquan He wrote: > > > > > [...] > > > > > > > The logic of this patch is somewhat similar to my first one. If high order > > > > > > > allocation fails, it will go normal mapping. > > > > > > > > > > > > > > However I also save the fallback position. The ones before this position are > > > > > > > used for huge mapping, the ones >= position for normal mapping as Barry said. > > > > > > > "support the combination of PMD and PTE mapping". this will take some > > > > > > > times as it needs to address the corner cases and do some tests. > > > > > > > > > > > > Hmm, we may not need to worry about the imperfect mapping. Currently > > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof() > > > > > > and vmalloc_huge(). > > > > > > > > > > > > For vmalloc_huge(), it's called in below three interfaces which are all > > > > > > invoked during boot. Basically they can succeed to get required contiguous > > > > > > physical memory. I guess that's why Tangquan only spot this issue on kvmalloc > > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_node(), > > > > > > we have told that in the code comment above __kvmalloc_node_noprof(), > > > > > > it's a best effort behaviour. > > > > > > > > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example. > > > > > because the align requirement of huge. the real size is 4M. > > > > > if allocation first order-9 successfully and the next failed. becuase the > > > > > fallback, the layout out pages would be like order9 - 512 * order0 > > > > > order9 support huge mapping, but order0 not. > > > > > with the patch above, would call vmap_small_pages_range_noflush() and do normal > > > > > mapping, the huge mapping would not exist. > > > > > > > > > > > mm/mm_init.c <> > > > > > > table = vmalloc_huge(size, gfp_flags); > > > > > > net/ipv4/inet_hashtables.c <> > > > > > > new_hashinfo->ehash = vmalloc_huge(ehash_entries * sizeof(struct inet_ehash_bucket), > > > > > > net/ipv4/udp.c <> > > > > > > udptable->hash = vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hslot) > > > > > > > > > > > > Maybe we should add code comment or document to notice people that the > > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() if you > > > > > > use it after boot. > > > > > > > > > > > > > > > > > > > > IMO, the draft can fix the current issue, it also does not have significant side > > > > > > > effects. Barry, what do you think about this patch? If you think it's okay, > > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP and the > > > > > > > other to address the current mapping issue. > > > > > > > > > > > > > > -- > > > > > > > help you, help me, > > > > > > > Hailong. > > > > > > > > > > > > > > > > > > > > > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL and > > > > fallback to order 0, actuaally without this commit > > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") > > > > if __vmalloc_area_node allocation failed, it will goto fail and try order-0. > > > > > > > > fail: > > > > if (shift > PAGE_SHIFT) { > > > > shift = PAGE_SHIFT; > > > > align = real_align; > > > > size = real_size; > > > > goto again; > > > > } > > > > > > > > So do we really need fallback to order-0 if nofail? > > > > > > Good catch, this is what I missed. I feel we can revert Michal's fix. > > > And just remove __GFP_NOFAIL bit when we are still allocating > > > by high-order. When "goto again" happens, we will allocate by > > > order-0, in this case, we keep the __GFP_NOFAIL. > > > > With Michal's patch, the fallback will be able to satisfy the allocation > > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The > > Hi Baoquan: > > int __vmap_pages_range_noflush(unsigned long addr, unsigned long end, > pgprot_t prot, struct page **pages, unsigned int page_shift) > { > unsigned int i, nr = (end - addr) >> PAGE_SHIFT; > > WARN_ON(page_shift < PAGE_SHIFT); > > if (!IS_ENABLED(CONFIG_HAVE_ARCH_HUGE_VMALLOC) || > page_shift == PAGE_SHIFT) > return vmap_small_pages_range_noflush(addr, end, prot, pages); > > for (i = 0; i < nr; i += 1U << (page_shift - PAGE_SHIFT)) { ---> huge mapping > int err; > > err = vmap_range_noflush(addr, addr + (1UL << page_shift), > page_to_phys(pages[i]), prot, ---------> incorrect mapping would occur here if nofail and fallback to order0 Thanks. I have got this issue from your patch. I mean if we have adjusted the page_shift and page_order after fallback with the draft patch I proposed, Barry still mentioned the nofail issue, that confuses me. > page_shift); > if (err) > return err; > > addr += 1UL << page_shift; > } > > return 0; > } > > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be > > problemtic since it could jump to fail becuase vmap_pages_range() > > invocation failed, or partially allocate huge parges and break down, > > then it will ignore the already allocated pages, and do all the thing again. > > > > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes > > sense is it fallback to the real_size and real_align. BUT we need handle > > the fail separately, e.g > > 1)__get_vm_area_node() failed; > > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail; > > 3)vmap_pages_range() failed; > > > > Honestly, I didn't see where the nofail is mishandled, could you point > > it out specifically? I could miss it. > > > > Thanks > > Baoquan > > > > -- > help you, help me, > Hailong. >