From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 14B8FCCA473 for ; Wed, 1 Jun 2022 15:04:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4A3BB8D0021; Wed, 1 Jun 2022 11:04:32 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 44D4C8D001C; Wed, 1 Jun 2022 11:04:32 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 316548D0021; Wed, 1 Jun 2022 11:04:32 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 225EA8D001C for ; Wed, 1 Jun 2022 11:04:32 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay12.hostedemail.com (Postfix) with ESMTP id 957E1120BF6 for ; Wed, 1 Jun 2022 15:04:31 +0000 (UTC) X-FDA: 79529988342.13.9255090 Received: from mail-ej1-f46.google.com (mail-ej1-f46.google.com [209.85.218.46]) by imf13.hostedemail.com (Postfix) with ESMTP id 29F9320085 for ; Wed, 1 Jun 2022 15:03:58 +0000 (UTC) Received: by mail-ej1-f46.google.com with SMTP id n10so4386243ejk.5 for ; Wed, 01 Jun 2022 08:04:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc:content-transfer-encoding; bh=vUeielUxwdqidWMyUmw+vl1VOUA69C9db3W/HKcLxII=; b=QLzwaCktchB6dQB8ixSjWDYooq4Cwu4lo2Ynlz2YG0A/01Vrw9hFG0mp/sV60eXFx2 E6l5oYW1CHlruCfXbKpLZUcJgYXjyzwjJFXVvjrHLXZ/U6XcGtJJKw9NbDqGTbFqrgkL 3uzzlAh+yXoeXDpAZwUnQUV6P2oswmm5s8+D48EQzhjHevzG08nQ/CFT5lSLXVJmr+oC 1pU2b5h7V34fI9nJM+Ex7JSGNhPcVdH8r/8ZURpjOqbOK05UjX2mpUXRFholPqe8R5ex hB2v98VrQ7zWcoRZaFSTQ0XpNS5xOFAJwjQoama+B61T5tHIw5nTCCssWlzLjPqpCCtI J/7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc:content-transfer-encoding; bh=vUeielUxwdqidWMyUmw+vl1VOUA69C9db3W/HKcLxII=; b=rsqie4y6mJgYGhlIbR/97m7zrpA6xZJOPmIHX1tvbwi2GXDJq7YdzyrKnND/xqAXQm ss6/6ZAc9n49UMxFDuICO6XiDHYVoe8lUnRUiHfuRjNA4XkWhNnqAycnzR/i231OGuvm /lPrGCKQGAZOkBYu9Zr+8bZpy+HGxtxNqhE5chHF9iaa8BZLDRSuscKLfdckiWZ3+dtA 38J4EE8uJ2J0Vjs5BKoeK24KgyLa6G1/4i98DmuH8n19oJwmu1HAUVvrFrtFcVe0eI1g bHyn4SWaVg273GDIwl4+AKKnoRz76LP4dToyFMOfRhmcITzwYXCp/1BLulgnMmgZZMlS k+tw== X-Gm-Message-State: AOAM5324UWe5pu+N4Mh42AVTNS4SmFBglzt/5TudXgDAFJ6msFBrXZXM FZIdvRuGV2T4/mew50eZcv5hwm3Wd0dLbk8AkQk= X-Google-Smtp-Source: ABdhPJyeAXR2TVzyrcTWpKLgVUPF0TRmXl0lbPdygeZ4MBGT+a+e5Ru8EvCgLYTsVnom613+/RkqqmzVpltChA3HBP0= X-Received: by 2002:a17:907:8a25:b0:6fe:ff5b:81f8 with SMTP id sc37-20020a1709078a2500b006feff5b81f8mr329219ejc.184.1654095869835; Wed, 01 Jun 2022 08:04:29 -0700 (PDT) MIME-Version: 1.0 References: <20220531081412.22db88cc@kernel.org> <1654011382-2453-1-git-send-email-chen45464546@163.com> <20220531084704.480133fa@kernel.org> <3498989.c69f.1811f41186e.Coremail.chen45464546@163.com> In-Reply-To: <3498989.c69f.1811f41186e.Coremail.chen45464546@163.com> From: Alexander Duyck Date: Wed, 1 Jun 2022 08:04:17 -0700 Message-ID: Subject: Re: Re: [PATCH v2] mm: page_frag: Warn_on when frag_alloc size is bigger than PAGE_SIZE To: =?UTF-8?B?5oSa5qCR?= Cc: Jakub Kicinski , Andrew Morton , linux-mm , LKML , Netdev Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 29F9320085 X-Stat-Signature: 5j6yi6hexq7ggoeos9i75x4phms6xcra X-Rspam-User: Authentication-Results: imf13.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=QLzwaCkt; spf=pass (imf13.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.218.46 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-HE-Tag: 1654095838-525819 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jun 1, 2022 at 5:33 AM =E6=84=9A=E6=A0=91 wr= ote: > > At 2022-06-01 01:28:59, "Alexander Duyck" wro= te: > >On Tue, May 31, 2022 at 8:47 AM Jakub Kicinski wrote: > >> > >> On Tue, 31 May 2022 23:36:22 +0800 Chen Lin wrote: > >> > At 2022-05-31 22:14:12, "Jakub Kicinski" wrote: > >> > >On Tue, 31 May 2022 22:41:12 +0800 Chen Lin wrote: > >> > >> The sample code above cannot completely solve the current problem= . > >> > >> For example, when fragsz is greater than PAGE_FRAG_CACHE_MAX_SIZE= (32768), > >> > >> __page_frag_cache_refill will return a memory of only 32768 bytes= , so > >> > >> should we continue to expand the PAGE_FRAG_CACHE_MAX_SIZE? Maybe = more > >> > >> work needs to be done > >> > > > >> > >Right, but I can think of two drivers off the top of my head which = will > >> > >allocate <=3D32k frags but none which will allocate more. > >> > > >> > In fact, it is rare to apply for more than one page, so is it necess= ary to > >> > change it to support? > >> > >> I don't really care if it's supported TBH, but I dislike adding > >> a branch to the fast path just to catch one or two esoteric bad > >> callers. > >> > >> Maybe you can wrap the check with some debug CONFIG_ so it won't > >> run on production builds? > > > >Also the example used here to define what is triggering the behavior > >is seriously flawed. The code itself is meant to allow for order0 page > >reuse, and the 32K page was just an optimization. So the assumption > >that you could request more than 4k is a bad assumption in the driver > >that is making this call. > > > >So I am in agreement with Kuba. We shouldn't be needing to add code in > >the fast path to tell users not to shoot themselves in the foot. > > > >We already have code in place in __netdev_alloc_skb that is calling > >the slab allocator if "len > SKB_WITH_OVERHEAD(PAGE_SIZE)". We could > >probably just add a DEBUG wrapped BUG_ON to capture those cases where > >a driver is making that mistake with __netdev_alloc_frag_align. > > Thanks for the clear explanation. > The reality is that it is not easy to capture the drivers that make such = mistake. > Because memory corruption usually leads to errors on other unrelated modu= les. > Not long ago, we have spent a lot of time and effort to locate a issue th= at > occasionally occurs in different kernel modules, and finally find the roo= t cause is > the improper use of this netdev_alloc_frag interface in DPAA net driver f= rom NXP. > It's a miserable process. > > I also found that some net drivers in the latest Linux version have this = issue. > Like: > 1. netdev_alloc_frag "len" may larger than PAGE_SIZE > #elif (PAGE_SIZE >=3D E1000_RXBUFFER_4096) > adapter->rx_buffer_len =3D PAGE_SIZE; > #endif > > static unsigned int e1000_frag_len(const struct e1000_adapter *a) > { > return SKB_DATA_ALIGN(a->rx_buffer_len + E1000_HEADROOM) + > SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); > } > > static void *e1000_alloc_frag(const struct e1000_adapter *a) > { > unsigned int len =3D e1000_frag_len(a); > u8 *data =3D netdev_alloc_frag(len); > } > "./drivers/net/ethernet/intel/e1000/e1000_main.c" 5316 --38%-- So there isn't actually a bug in this code. Specifically the code is split up between two paths. The first code block comes from the jumbo frames path which creates a fraglist skb and will memcpy the header out if I recall correctly. The code from the other two functions is from the non-jumbo frames path which has restricted the length to MAXIMUM_ETHERNET_VLAN_SIZE. > 2. netdev_alloc_frag "ring->frag_size" may larger than (4096 * 3) > > #define MTK_MAX_LRO_RX_LENGTH (4096 * 3) > if (rx_flag =3D=3D MTK_RX_FLAGS_HWLRO) { > rx_data_len =3D MTK_MAX_LRO_RX_LENGTH; > rx_dma_size =3D MTK_HW_LRO_DMA_SIZE; > } else { > rx_data_len =3D ETH_DATA_LEN; > rx_dma_size =3D MTK_DMA_SIZE; > } > > ring->frag_size =3D mtk_max_frag_size(rx_data_len); > > for (i =3D 0; i < rx_dma_size; i++) { > ring->data[i] =3D netdev_alloc_frag(ring->frag_size); > if (!ring->data[i]) > return -ENOMEM; > } > "drivers/net/ethernet/mediatek/mtk_eth_soc.c" 3344 --50%-- > > I will try to fix these drivers later. This one I don't know as much about, and it does appear to contain a bug. What it should be doing is a check before doing the netdev_alloc_frag call to verify if it is less than 4K then it uses netdev_alloc_frag, if it is greater then it needs to use alloc_pages. > Even experienced driver engineers may use this netdev_alloc_frag > interface incorrectly. > So I thought it is best to provide some prompt information of usage > error inside the netdev_alloc_frag, or it's OK to report such mistake > during system running which may caused by fragsz varies(exceeded page siz= e). > > Now, as you and Kuba mentioned earlier, "do not add code in fast path". > > Can we just add code to the relatively slow path to capture the mistake > before it lead to memory corruption? > Like: > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index e6f211d..ac60a97 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -5580,6 +5580,7 @@ void *page_frag_alloc_align(struct page_frag_cache = *nc, > /* reset page count bias and offset to start of new frag = */ > nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE + 1; > offset =3D size - fragsz; > + BUG_ON(offset < 0); > } > > nc->pagecnt_bias--; > I think I could be onboard with a patch like this. The test shouldn't add more than 1 instruction since it is essentially just a jump if signed test which will be performed after the size - fragsz check. > Additional, we may modify document to clearly indicate the limits of the > input parameter fragsz. > Like: > diff --git a/Documentation/vm/page_frags.rst b/Documentation/vm/page_frag= s.rst > index 7d6f938..61b2805 100644 > --- a/Documentation/vm/page_frags.rst > +++ b/Documentation/vm/page_frags.rst > @@ -4,7 +4,7 @@ > Page fragments > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > -A page fragment is an arbitrary-length arbitrary-offset area of memory > +A page fragment is an arbitrary-length(must <=3D PAGE_SIZE) arbitrary-of= fset area of memory > which resides within a 0 or higher order compound page. The main thing I would call out about the page fragment is that it should be less than an order 0 page in size, ideally at least half a page to allow for reuse even in the case of order 0 pages. Otherwise it is really an abuse of the interface as it isn't really meant to be allocating 1 fragment per page since the efficiency will drop pretty significantly as memory becomes fragmented and it becomes harder to allocate higher order pages. It would essentially just become alloc_page with more overhead.