From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 76938C49EA1 for ; Fri, 26 Jul 2024 09:15:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D57E46B0092; Fri, 26 Jul 2024 05:15:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D081B6B0095; Fri, 26 Jul 2024 05:15:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BA8C76B0096; Fri, 26 Jul 2024 05:15:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 9C3C66B0092 for ; Fri, 26 Jul 2024 05:15:51 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1987F416FA for ; Fri, 26 Jul 2024 09:15:51 +0000 (UTC) X-FDA: 82381346502.21.4BE73BC Received: from mail-vs1-f48.google.com (mail-vs1-f48.google.com [209.85.217.48]) by imf14.hostedemail.com (Postfix) with ESMTP id 3FFC2100005 for ; Fri, 26 Jul 2024 09:15:49 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PdfWJlRq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721985347; a=rsa-sha256; cv=none; b=L2q3B7lW9eYNFroncsAT7gSxT0RIwVwqr7GtnueweQWVoMCYHXqmFoi92hcPJeuUVlCRfp fdcybchkdw1E+4oSToIEDPrZ5381XSHUWomjrta65feGbXFqn5X+AwsZpkr4kofgwpBRrx xxYZE4SMWtYBOcJDjq+AhWrYqMIMj0Q= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=PdfWJlRq; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf14.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.48 as permitted sender) smtp.mailfrom=21cnbao@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721985347; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=MHUp/YEibHMgEvyjOywFC1rVXXKb5KCbG4vI6T8V30Y=; b=ojZzMgqsnvkxAYN+hYbrihCjRbS5RslwdU9G7SycmAp8W1O1cAekFhSjl/3WFr1fxI2dXN HldoemNnr8M4sGnwYMkgoAVlz6H3rzC8BCEBR3W+MDGh4ExqZjThopl0opLyX4n66nCwSu m/al0DmVtiwit6Qw768ftEIjut3r7XA= Received: by mail-vs1-f48.google.com with SMTP id ada2fe7eead31-492aae5fde6so605764137.1 for ; Fri, 26 Jul 2024 02:15:48 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721985348; x=1722590148; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=MHUp/YEibHMgEvyjOywFC1rVXXKb5KCbG4vI6T8V30Y=; b=PdfWJlRqi+PmfneZWQJM9DWehR0cc/hKkimewTkQWMVudJktgAbbxq3uEdq8VATeuj J8smech0zEmvcDYTyfiBPzVvqL1Av7bwRjGFKKiyPT4n8aZu4SGRI3Dp2yQBvAeb6gbU V4gpkOtKRitZhN8titLdE8DcR50mcdxKBoOPsDxScyKcxqZHig9k1Dg4tWd8U1CpA9Au Uj4yBWql00SxbBCQUCNAy96KdRE2PM42s4TQ70n9me02b4kOonCFlxgPftz7AGAou8LY xWbTbvCE1Nwd+V58hSzSsp1kVk4w66qvRFKVeXSPiwL9ANaynESXJSHD5I3FJRXCtSKP YCeg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721985348; x=1722590148; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=MHUp/YEibHMgEvyjOywFC1rVXXKb5KCbG4vI6T8V30Y=; b=F7rJA4ifcB1LpyTQq23NZPnN9PtOvSWO2KQBo7OomiY69mLyh+J97ZleKtrLz+Nojp S6RpaDUjXt20yr84HvdpR+HUYDOxWFm5VGBZnRnZnU8XQZAmWmBO0alIw5wWraGJ+c/S 06PMiAHFtszNnJqpV8ZY0hRCNjfB4oOa3DjDOKMi7M2kF3N48nLFd5h/fgvfKQ6Po6JF dwiLgGgdEUd1WczUNQ6Fnfa1DKtL/f5XcbAgON5XCJE0GAuFLSypyfhB4kWBbznHNN4e vl1TLUj+nTDvwklD8RXKgp/jCZ3aNi+3/G8Icmw8Wxbo7eORYmoctzE8yrgGYZBbG0bR yCxw== X-Forwarded-Encrypted: i=1; AJvYcCVoeJak+hqmJwsWKderK4g/c0wk03ZMDF5e7QxGbMZrjvNkp9hNVxdotXtXZWNcDxS4LbnH8ELiYYHI5mdYHE59Mo8= X-Gm-Message-State: AOJu0YxKso3EGEkq4fjbeOCdxYzxRCRljgTprME3zkxQokhvN1CZDpUX MIULn7BUR3H8U4/k3DCMbcnTIwACcPqhJOyaJg/+uLaqgBKdTn1sjVW8tNCO0oT8DkYu+c3ebmJ bAQxxXdWxZnvDkZw9YrR/X+J3gXU= X-Google-Smtp-Source: AGHT+IFyu/OD7RMD2Zd6YdOZ7bBTZ3lTg4KPjPn3NrIv7gG22SyGWtP5Oz15hh8AuJkZ0SkOhlkwF8Fwq7t0qb/lt78= X-Received: by 2002:a05:6102:4b14:b0:48f:e4b0:81b4 with SMTP id ada2fe7eead31-493d99f776dmr5997687137.7.1721985348084; Fri, 26 Jul 2024 02:15:48 -0700 (PDT) MIME-Version: 1.0 References: <20240725035318.471-1-hailong.liu@oppo.com> <20240725164003.ft6huabwa5dqoy2g@oppo.com> <20240726040052.hs2gvpktrnlbvhsq@oppo.com> <20240726050356.ludmpxfee6erlxxt@oppo.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 26 Jul 2024 21:15:36 +1200 Message-ID: Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 To: Baoquan He Cc: Hailong Liu , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Vlastimil Babka , Michal Hocko , Matthew Wilcox , Tangquan Zheng , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Queue-Id: 3FFC2100005 X-Rspamd-Server: rspam01 X-Stat-Signature: n8ax6qk7aguxxmyneothnkc7dfb7q7wj X-HE-Tag: 1721985349-626922 X-HE-Meta: U2FsdGVkX184TrXCpdQuLLNWwZuwjormu94MPfZVLjn6iehQZ511jcRo1Usi95pDxKJstS1t6kqXXEmGaczUktzvPo0dh7tJpGCjyJ3D3A8R+f1Xdy9wh3BOdjUX0Pt7Dyd2BeaDzbUBwc0cooEOFmdnCKEVif/+VQcRhTbrqK1+h2eBxAXUpanwves8kbIlxZYsGKcs4GVAOHD9ERdHz7hjOf+LX6PzNIdfvITDFsWUGLtZEOTz0gw146DF9dskR66V5TG3l8rgjGjb19xCjTrGESqIiFeC9AfWGLZogjHugRdLYYxplmL01NBKkmT5W5YWKmm8Ac5fpnxpfQVFHnexRe5xo3/ez9SpHlwFz9Z7FBhcQ5metKNZg6tEOg5jccWIVlh7u5v7/6AMeKWs5/GiTvASOJ8sfjKZbyEfwr2kotsdcdAtMCESHzQ+ZOTgW8w8guEhVfGrNN4501PzN/1L7DEj5Lu9kGg7kb3SvWxJZ/Vve9UO4srg6lBEENcTVfJvU6uBBl9sGDa3wtmXu0Oj+r98/ajVq//Gs/xQYL6gt4GvhpqX84G8p8cG/Nq0Q5+4hMVvx19yUuDD/732y/+MCc4ilNdGJOvh2+Jqj5y33nbZr4NH/f7F3rwqj0Sc/vihGRMdG5w8w3m6d5nXy1qVnaE9bhsbAweGZAjNoqg5BrApfZ+R5PXsBEWqA9XtwbTw81143qG3W8sX3SzPZFlrAVXGMrQgnV0UkVnddsVn6R4ZXbpAK9mdyRHZWYnlcI3OqBDRJZroJqt/Jc4s9aD/kB5JMWhU/JEzF06XiNf11SjTye3d85a35QcL7MKbNTc7yA0c5sVfjtye+UGSvp6Bgdt0pubYi7+rSXx0qScmv83XXHSrlsW9iGtj/EAFmz1bp9C3qOCKKT9HehkMVgyoEJar1NC4vxED/YjApCF6NW0oa+JxThQrGhm0vwVDvwHE1j44xRRFwX4D1Mg 1y8fwPxj WiLWGJ+dVxPGLurbQslZzT+Br2VBoUY5J896iaT4B1Jf8bISUAVNzCgQdEDXlzeY057FDYsgOyd/n6ofDIwVr7jxBSJpvcxQc21jldpnynt0KO2mpfK9aOxZER5T3E8TCZeoiZDObvDaN2kokSjS8V+vljKQLSjAZqMv1Ov6U883Ga4XtSubu32zb2pq81o++7CkF2nUZO21LQCxwx+3D0FdgQ5TP7WQLah6KMT2r67byKpv26HQXtJbC3MYCUpnDEKurQTDJfXx1UnnsWyzrT+Xo+MeXtAG7b4J+eCnFQnkmBB9ZPADeU5mN9pTOGp375LPCP9piEq7nJVIj0Jal8RSY2i5ROm0gebtQLn6z9XH390K6rH9htG8u7dDc5QlYA7glvXlPW3Y7nNot15Jw4r/E7ArMwxQdsTJsO/1zVyPVNXI= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 26, 2024 at 8:38=E2=80=AFPM Baoquan He wrote: > > On 07/26/24 at 05:29pm, Barry Song wrote: > > On Fri, Jul 26, 2024 at 5:04=E2=80=AFPM Hailong Liu wrote: > > > > > > On Fri, 26. Jul 12:00, Hailong Liu wrote: > > > > On Fri, 26. Jul 10:31, Baoquan He wrote: > > > > [...] > > > > > > The logic of this patch is somewhat similar to my first one. If= high order > > > > > > allocation fails, it will go normal mapping. > > > > > > > > > > > > However I also save the fallback position. The ones before this= position are > > > > > > used for huge mapping, the ones >=3D position for normal mappin= g as Barry said. > > > > > > "support the combination of PMD and PTE mapping". this will ta= ke some > > > > > > times as it needs to address the corner cases and do some tests= . > > > > > > > > > > Hmm, we may not need to worry about the imperfect mapping. Curren= tly > > > > > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_= noprof() > > > > > and vmalloc_huge(). > > > > > > > > > > For vmalloc_huge(), it's called in below three interfaces which a= re all > > > > > invoked during boot. Basically they can succeed to get required c= ontiguous > > > > > physical memory. I guess that's why Tangquan only spot this issue= on kvmalloc > > > > > invocation when the required size exceeds e.g 2M. For kvmalloc_no= de(), > > > > > we have told that in the code comment above __kvmalloc_node_nopro= f(), > > > > > it's a best effort behaviour. > > > > > > > > > Take a __vmalloc_node_range(2.1M, VM_ALLOW_HUGE_VMAP) as a example. > > > > because the align requirement of huge. the real size is 4M. > > > > if allocation first order-9 successfully and the next failed. becua= se the > > > > fallback, the layout out pages would be like order9 - 512 * order0 > > > > order9 support huge mapping, but order0 not. > > > > with the patch above, would call vmap_small_pages_range_noflush() a= nd do normal > > > > mapping, the huge mapping would not exist. > > > > > > > > > mm/mm_init.c <> > > > > > table =3D vmalloc_huge(size, gfp_flags); > > > > > net/ipv4/inet_hashtables.c <> > > > > > new_hashinfo->ehash =3D vmalloc_huge(ehash_entries * sizeof(stru= ct inet_ehash_bucket), > > > > > net/ipv4/udp.c <> > > > > > udptable->hash =3D vmalloc_huge(hash_entries * 2 * sizeof(struct= udp_hslot) > > > > > > > > > > Maybe we should add code comment or document to notice people tha= t the > > > > > contiguous physical pages are not guaranteed for vmalloc_huge() i= f you > > > > > use it after boot. > > > > > > > > > > > > > > > > > IMO, the draft can fix the current issue, it also does not have= significant side > > > > > > effects. Barry, what do you think about this patch? If you thin= k it's okay, > > > > > > I will split this patch into two: one to remove the VM_ALLOW_HU= GE_VMAP and the > > > > > > other to address the current mapping issue. > > > > > > > > > > > > -- > > > > > > help you, help me, > > > > > > Hailong. > > > > > > > > > > > > > > > > > > > I check the code, the issue only happen in gfp_mask with __GFP_NOFAIL= and > > > fallback to order 0, actuaally without this commit > > > e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") > > > if __vmalloc_area_node allocation failed, it will goto fail and try o= rder-0. > > > > > > fail: > > > if (shift > PAGE_SHIFT) { > > > shift =3D PAGE_SHIFT; > > > align =3D real_align; > > > size =3D real_size; > > > goto again; > > > } > > > > > > So do we really need fallback to order-0 if nofail? > > > > Good catch, this is what I missed. I feel we can revert Michal's fix. > > And just remove __GFP_NOFAIL bit when we are still allocating > > by high-order. When "goto again" happens, we will allocate by > > order-0, in this case, we keep the __GFP_NOFAIL. > > With Michal's patch, the fallback will be able to satisfy the allocation > for nofail case because it fallback to 0-order plus __GFP_NOFAIL. The My point is that vm_area_alloc_pages() is an internal function and just an implementation detail. As long as its caller, __vmalloc_area_node(), can support NOFAIL, it's fine. Therefore, we can skip NOFAIL support for high-order allocations in vm_area_alloc_pages() and limit GFP_NOFAIL support to order-0. Good news is that __vmalloc_node_range_noprof() has already a way to fallba= ck to order-0 fail: if (shift > PAGE_SHIFT) { shift =3D PAGE_SHIFT; align =3D real_align; size =3D real_size; goto again; } So, we can definitely utilize this fallback instead of implementing it within vm_area_alloc_pages(), which would alter the page_order of vm_area and create inconsistency, crashing the system due to memory corruption. With higher-level fallback in __vmalloc_node_range_noprof(), we won't need an unusual fix-up for vm_area as you're proposing. The page_order of vm_area will consistently stay the same. If there is any way to improvement, we may also add: diff --git a/mm/vmalloc.c b/mm/vmalloc.c index caf032f0bd69..03d8148d7a02 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3806,7 +3806,7 @@ void *__vmalloc_node_range_noprof(unsigned long size, unsigned long align, warn_alloc(gfp_mask, NULL, "vmalloc error: size %lu, vm_struct allocation failed%s", real_size, (nofail) ? ". Retrying." : ""); - if (nofail) { + if (nofail && shift =3D=3D PAGE_SHIFT) { schedule_timeout_uninterruptible(1); goto again; } There's no need to keep retrying for __get_vm_area_node() until success, as we will succeed when we fall back to order-0. > 'if (shift > PAGE_SHIFT)' conditional checking and handling may be > problemtic since it could jump to fail becuase vmap_pages_range() > invocation failed, or partially allocate huge parges and break down, > then it will ignore the already allocated pages, and do all the thing aga= in. > > The only thing 'if (shift > PAGE_SHIFT)' checking and handling makes > sense is it fallback to the real_size and real_align. BUT we need handle > the fail separately, e.g > 1)__get_vm_area_node() failed; > 2)vm_area_alloc_pages() failed when shift > PAGE_SHIFT and non-nofail; > 3)vmap_pages_range() failed; > > Honestly, I didn't see where the nofail is mishandled, could you point > it out specifically? I could miss it. > > Thanks > Baoquan > Thanks Barry