From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4438EC3DA49 for ; Fri, 26 Jul 2024 03:53:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 66FB36B0098; Thu, 25 Jul 2024 23:53:33 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F9036B0099; Thu, 25 Jul 2024 23:53:33 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 472A56B009A; Thu, 25 Jul 2024 23:53:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 285736B0098 for ; Thu, 25 Jul 2024 23:53:33 -0400 (EDT) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 83B91141561 for ; Fri, 26 Jul 2024 03:53:32 +0000 (UTC) X-FDA: 82380534264.04.8590EE8 Received: from mail-vs1-f45.google.com (mail-vs1-f45.google.com [209.85.217.45]) by imf02.hostedemail.com (Postfix) with ESMTP id BEBA580004 for ; Fri, 26 Jul 2024 03:53:30 +0000 (UTC) Authentication-Results: imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="LZ7/C9PK"; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721966009; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BPyzrMlqeQRZ17+Ka9dRzkWQ9c9VHRZOmAi9r7+q1LM=; b=JH6eEsMx12+ATHo/DluSr2s2x+s9IkNwblwCdnJrPKIZ912ZmSrByMrmFGU5rURNO69hkl hvA0dot5+BjzFIbIvn2TFoL+dl0tXo5maMcDDoe2cMHBVafQqAch+ZJWn0ZGMt7V86b8/C WkKQUL/mRLPa4JaaU1peUPnNcMj/gHs= ARC-Authentication-Results: i=1; imf02.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b="LZ7/C9PK"; spf=pass (imf02.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.217.45 as permitted sender) smtp.mailfrom=21cnbao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721966009; a=rsa-sha256; cv=none; b=TMNOZhSLtiVZraYDODZmv9wM9lHi+bTRQVDOQxI+UDJjYFGVv2YLbo4UtCKcynI2nOORdR uTBMipeywobwI5kH6sM1+RTmA0U9KaeezCm5msYWH8La6Azyeoh44dBYJJCuweGoRWvqY/ 5BziXOPHc9EZW6Yiiz4hCbhbqX9u1Vw= Received: by mail-vs1-f45.google.com with SMTP id ada2fe7eead31-49299323d71so590841137.1 for ; Thu, 25 Jul 2024 20:53:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1721966010; x=1722570810; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=BPyzrMlqeQRZ17+Ka9dRzkWQ9c9VHRZOmAi9r7+q1LM=; b=LZ7/C9PK2+0WkUCxEC6yExnMptvSwWhM8n9ToaD3ZBc48ZgoYhQZXCLFwDYenlzdg3 005FPo/eGBB+WXn2/qc8a4FoFZp/ddv9LuAfA4/wt1sn7//BNLOASGuGmnH8K8pP42J5 CcC/WuFqaLk9iq+TmeUGoxve56+bq2nhm3uCQCZ7MsTC8N9ZhrCNoNCmfHMtCL9Pwgf7 JfhVKEFgKz2FmJntUwQRaWU0Gdjd5QAt1BSX6sOHvrsNCrzNrC6WrYqE1JYqYQhzMVZm +72aviuDzmygP4iTdf1C0QULk9fCAIh3y+alifzJguI6sakQpnjfdZ5exEdaM/+CAq4r J5eg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1721966010; x=1722570810; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=BPyzrMlqeQRZ17+Ka9dRzkWQ9c9VHRZOmAi9r7+q1LM=; b=jwZ2NNwm0uY17Jz0LNxa2/UCINRko26Lm1MHG9d4bW4sqj+pphxk5wlPsDRavo2fVi ouyYi8HHeiuvfEdCimZ+eikpMHqGAL7ddhwP6bTfE9e7m+ylRIA1QbEy1LjadkOh1gd1 Z7u3sJ4esFmL5Y3X4MM2R31rL5Zartelf4yGa99/tnrPVVNVNMZrUJ6gaOduzc+Y12Zz OVNEgoiwW55B3iWqFkBrzeHlZT7HsMlhLRd9R4yu/PFDWwZsokq5ztuQzyI5gqUCiY82 EgnSYj9J4NRoXLWFYrTJVluAu8eHp+eVTGFC7RVyqPvy8XBsrL0ehUlGMMjOtGJ97zQp H9Tg== X-Forwarded-Encrypted: i=1; AJvYcCW8K1FIk3J22Sy8zxjcjWgkDlzzWqifPGdrY0qjiv9S/uuuEONFaw3AblwAkh4RUoJRYDgm/jHoAHoG4aEL1gX/GOE= X-Gm-Message-State: AOJu0YzHG2HqmS+GYIWYiaBpIWYvE2o5RnLHGnOIpT2jmE9ndTZ81uMO 8JTiK2mrS+F8KRwSvbQMpEqke3nx47/XG3UXLI91MGP4h6U0W+wqeHfq8Ax5KraqFiNhaU09Umi H/yBIw8Va8Gh1e296ep2Q33KprYI= X-Google-Smtp-Source: AGHT+IENuHk+XYdmaKndbm1NyrRpo5lo+VKYKMgky8xiysVwKKilSKKOWdgqRnmVE6SDf3NzSrBUj++y55g8ri9HiVo= X-Received: by 2002:a05:6102:6cb:b0:493:ddd1:d7fc with SMTP id ada2fe7eead31-493dfe5cb87mr3497359137.11.1721966009587; Thu, 25 Jul 2024 20:53:29 -0700 (PDT) MIME-Version: 1.0 References: <20240725035318.471-1-hailong.liu@oppo.com> <20240725164003.ft6huabwa5dqoy2g@oppo.com> In-Reply-To: From: Barry Song <21cnbao@gmail.com> Date: Fri, 26 Jul 2024 15:53:18 +1200 Message-ID: Subject: Re: [RFC PATCH v2] mm/vmalloc: fix incorrect __vmap_pages_range_noflush() if vm_area_alloc_pages() from high order fallback to order0 To: Baoquan He Cc: Hailong Liu , Andrew Morton , Uladzislau Rezki , Christoph Hellwig , Lorenzo Stoakes , Vlastimil Babka , Michal Hocko , Matthew Wilcox , Tangquan Zheng , linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: gfkhnp8fc9xk4ori9mixpty5gzfb54ih X-Rspamd-Queue-Id: BEBA580004 X-Rspamd-Server: rspam11 X-HE-Tag: 1721966010-772200 X-HE-Meta: U2FsdGVkX18OP5c1BLCG9J37nC9zn3K3qgnwpVFRuCWnugKolrOA+OjNSdqGChfN1m1glLGxh673wwhA2XK+MNv+SCbqU1eeGznn9gmaS4o0I7VyzyI8+BhtC4BG36sV5ln10p3hvz793uLgaTrYzTu+ELM4p0D828DKKjtTDIMgvEFDX1cKpX6RNy3DlJG4IlQqkf9TiuwsLvTPWO31z46MUYhR1JiMKN0p3G9O5jYDCiWK3Grc0aejo2SeMzgdnIqkol8giyIz4iGy7hxEvSkmMvDvQPa6LUnVeVYZ0MZ1PcHMgT4E4EPXZV/4G+24dYFFLYD0AJYts2MAdukxTxfgxRWDxfwuSaa3NBBEI5q0pvX+RYa0SGbwRds+Q9soENPK8DwNwi9lrvbVFM1kwh2bA7FzFtTvzKeFCdR+I3zp86YwFBFccf4Aqc9Qrh0UfgGgSUI5Cu2+adRRP3K+grldjh0Z0BryzrEnLxDXHSl2Pk5rBemwmZoU97CIorXQN3L9go1wcPn3G+zyTPc7SL/xYxtBuS66fdpDsHOC25h2QWsy0bv+waskDYMBaGbVyty1MUP6xoZO/hxaBrCV1zfW46rcfsVMHO1XXoaql0IB47nMZxRCCvNTSsxjIFQVIsBuyKszNFHb/StZAkmMQV9lbzT3g9N4nW7ypzYknccUbG0HkWkgLP8MvnkVIjFKwAGHO6R0JizjkY9gZIF3yTFo4IEGoi6dKN1nY8mZV7SYkCtc1odA4Ik0N2522AYEWb15kaXveNs4iFzG4+KQU+xAB5lxb8tQKml+JMMEP991JsLs5Ha7PvBbR6oA1yVNriopL/WIUT8a2Nw0wWhDh9KWWhYFGrs1TyjWAFZcBCVi2IUPPZRcAJbP74rZadrNHzPIVP7QT1LsfU4EzwKH7m/x/u2oAhV2azf1dpCZVY6u9Yy9qjBbFPHaV/G43Ekg0VTfSmQXVgNsL3toikQ 7VWiG1mz 9WC/YsGY3Vyovy0O6pZOlLwFgTGvJMZV3bNjBDaxsmitNJxJTyXdC0NRXdiH3k1BkZlK9G8GLzbRoLGGonqztK7L3+/Eb7kLHG0z00TchrR7hHdfxE2Ys22c9M6emie1a5zvbn4EkEMzDIqQ1ha8xocHW/UU+nce3k71PjcmGUNFyMZKRt1N4/tG6UuAPpF0xtpuAZinBZzoRxu+HgU/lQdUyFerrSj1N2CUQZvp0abFKSj9tU7tkx5JUyjrKwIVRW90b0Q9vrGT2ztV4WHQ/NEjw72mxhsdzhdipqo7Kp2Un+i7udRxOR3Eccf4fMJRdtsAJ6KqSqRga95EVe2GTWOThLFubqs48PLTmhUbjsgHzE2bKniT2BP2zJn2SW4yhR+QE2xhtHmsfo6J1+xbEtiNBw+qSCnLMHkIEzO/CWyEcR3MSfBZUWEA9PsextOoxQrRvZoVe/7ioamvXculDRVFcsA7NyMNAz5V+EdI42Zs3SQs+S/XJguvai+ezmdmEQFXlQNfCkvV6EuVBZQUr+4OaLT8jkTpNgIcq X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, Jul 26, 2024 at 2:31=E2=80=AFPM Baoquan He wrote: > > On 07/26/24 at 12:40am, Hailong Liu wrote: > > On Thu, 25. Jul 19:39, Baoquan He wrote: > > > On 07/25/24 at 11:53am, hailong.liu@oppo.com wrote: > > > > From: "Hailong.Liu" > > > > > > > > The scenario where the issue occurs is as follows: > > > > CONFIG: vmap_allow_huge =3D true && 2M is for PMD_SIZE > > > > kvmalloc(2M, __GFP_NOFAIL|GFP_XXX) > > > > __vmalloc_node_range(vm_flags=3DVM_ALLOW_HUGE_VMAP) > > > > vm_area_alloc_pages(order=3D9) --->allocs order9 failed and= fallback to order0 > > > > and phys_addr is aligned wi= th PMD_SIZE > > > > vmap_pages_range > > > > vmap_pages_range_noflush > > > > __vmap_pages_range_noflush(page_shift =3D 21) -= ---> incorrect vmap *huge* here > > > > > > > > In fact, as long as page_shift is not equal to PAGE_SHIFT, there > > > > might be issues with the __vmap_pages_range_noflush(). > > > > > > > > The patch also remove VM_ALLOW_HUGE_VMAP in kvmalloc_node(), There > > > > are several reasons for this: > > > > - This increases memory footprint because ALIGNMENT. > > > > - This increases the likelihood of kvmalloc allocation failures. > > > > - Without this it fixes the origin issue of kvmalloc with __GFP_NOF= AIL may return NULL. > > > > Besides if drivers want to vmap huge, user vmalloc_huge instead. > > > > > > Seem there are two issues you are folding into one patch: > > Got it. I will separate in the next version. > > > > > > > > one is the wrong informatin passed into __vmap_pages_range_noflush(); > > > the other is you want to take off VM_ALLOW_HUGE_VMAP on kvmalloc(). > > > > > > About the 1st one, do you think below draft is OK to you? > > > > > > Pass out the fall back order and adjust the order and shift for later > > > usage, mainly for vmap_pages_range(). > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index 260897b21b11..5ee9ae518f3d 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -3508,9 +3508,9 @@ EXPORT_SYMBOL_GPL(vmap_pfn); > > > > > > static inline unsigned int > > > vm_area_alloc_pages(gfp_t gfp, int nid, > > > - unsigned int order, unsigned int nr_pages, struct page **= pages) > > > + unsigned int *page_order, unsigned int nr_pages, struct p= age **pages) > > > { > > > - unsigned int nr_allocated =3D 0; > > > + unsigned int nr_allocated =3D 0, order =3D *page_order; > > > gfp_t alloc_gfp =3D gfp; > > > bool nofail =3D gfp & __GFP_NOFAIL; > > > struct page *page; > > > @@ -3611,6 +3611,7 @@ vm_area_alloc_pages(gfp_t gfp, int nid, > > > cond_resched(); > > > nr_allocated +=3D 1U << order; > > > } > > > + *page_order =3D order; > > > > > > return nr_allocated; > > > } > > > @@ -3654,7 +3655,7 @@ static void *__vmalloc_area_node(struct vm_stru= ct *area, gfp_t gfp_mask, > > > page_order =3D vm_area_page_order(area); > > > > > > area->nr_pages =3D vm_area_alloc_pages(gfp_mask | __GFP_NOWARN, > > > - node, page_order, nr_small_pages, area->pages); > > > + node, &page_order, nr_small_pages, area->pages); > > > > > > atomic_long_add(area->nr_pages, &nr_vmalloc_pages); > > > if (gfp_mask & __GFP_ACCOUNT) { > > > @@ -3686,6 +3687,10 @@ static void *__vmalloc_area_node(struct vm_str= uct *area, gfp_t gfp_mask, > > > goto fail; > > > } > > > > > > + > > > + set_vm_area_page_order(area, page_order); > > > + page_shift =3D page_order + PAGE_SHIFT; > > > + > > > /* > > > * page tables allocations ignore external gfp mask, enforce it > > > * by the scope API > > > > > The logic of this patch is somewhat similar to my first one. If high or= der > > allocation fails, it will go normal mapping. > > > > However I also save the fallback position. The ones before this positio= n are > > used for huge mapping, the ones >=3D position for normal mapping as Bar= ry said. > > "support the combination of PMD and PTE mapping". this will take some > > times as it needs to address the corner cases and do some tests. > > Hmm, we may not need to worry about the imperfect mapping. Currently > there are two places setting VM_ALLOW_HUGE_VMAP: __kvmalloc_node_noprof() > and vmalloc_huge(). > > For vmalloc_huge(), it's called in below three interfaces which are all > invoked during boot. Basically they can succeed to get required contiguou= s > physical memory. I guess that's why Tangquan only spot this issue on kvma= lloc > invocation when the required size exceeds e.g 2M. For kvmalloc_node(), > we have told that in the code comment above __kvmalloc_node_noprof(), > it's a best effort behaviour. > > mm/mm_init.c <> > table =3D vmalloc_huge(size, gfp_flags); > net/ipv4/inet_hashtables.c <> > new_hashinfo->ehash =3D vmalloc_huge(ehash_entries * sizeof(struct inet_= ehash_bucket), > net/ipv4/udp.c <> > udptable->hash =3D vmalloc_huge(hash_entries * 2 * sizeof(struct udp_hsl= ot) > > Maybe we should add code comment or document to notice people that the > contiguous physical pages are not guaranteed for vmalloc_huge() if you > use it after boot. Currently, the issue goes beyond just 'contiguous physical pages are not guaranteed.' The problem includes the likelihood of failure when trying to allocate 2MB of contiguous memory. That's why I suggest we allow fallback to order-0 for non-nofail allocations with your proposed changes. The only difference is that for non-nofail allocations, if we fall back to order-0 and still fail, the process will break. In the case of nofail, we always succeed on the final allocation. > > > > > IMO, the draft can fix the current issue, it also does not have signifi= cant side > > effects. Barry, what do you think about this patch? If you think it's o= kay, > > I will split this patch into two: one to remove the VM_ALLOW_HUGE_VMAP = and the > > other to address the current mapping issue. > > > > -- > > help you, help me, > > Hailong. > > Thanks Barry