From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4EFDEC4707B
	for <linux-mm@archiver.kernel.org>; Wed, 10 Jan 2024 16:21:47 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id D762C6B009B; Wed, 10 Jan 2024 11:21:46 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id CFE776B009C; Wed, 10 Jan 2024 11:21:46 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id BC6576B009D; Wed, 10 Jan 2024 11:21:46 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id ACCBC6B009B
	for <linux-mm@kvack.org>; Wed, 10 Jan 2024 11:21:46 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 826B11204A9
	for <linux-mm@kvack.org>; Wed, 10 Jan 2024 16:21:46 +0000 (UTC)
X-FDA: 81663917412.04.DB341F5
Received: from mail-pg1-f182.google.com (mail-pg1-f182.google.com [209.85.215.182])
	by imf12.hostedemail.com (Postfix) with ESMTP id BCD7F4000F
	for <linux-mm@kvack.org>; Wed, 10 Jan 2024 16:21:44 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="iiKJxh/w";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf12.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1704903704;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=6fNcFOOF02/MaOmBASv72DPEOF96Zq8cngbi91LzugQ=;
	b=EDtAcsWXyqZp1xygEj4gW/D4ikeY3s0+9g5HeQnyBnsZRz3mwRAtAISf/6UQKQWNWVyEmB
	mKHnnkFQKHnoRF+LCCk8xJjEpksyT6X8nJKJUk8xHIHhrIGO0f/mfijZFkqqrxcaWO1f9D
	yLjyIViFeWcPQHdWB0KvYl1prS/2kOY=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="iiKJxh/w";
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf12.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.215.182 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1704903704; a=rsa-sha256;
	cv=none;
	b=WvvC06Midq/pcXXnia4U/d7YKHdTDqCaCIUPwOSYbqDWdX5xeZ4nBXo39Y8Rj8pZ/ucS/1
	SuzMUiCKHKEDj4HtG56HKzyjufxPmYivjtQPZGe38PQT0kdIvIYFaRKKvHvMi8EUZ5Zt1O
	OuvD3C04XGXJG+vOhB4gncAJkw3bZUQ=
Received: by mail-pg1-f182.google.com with SMTP id 41be03b00d2f7-5ce07cf1e5dso2187914a12.2
        for <linux-mm@kvack.org>; Wed, 10 Jan 2024 08:21:44 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1704903703; x=1705508503; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=6fNcFOOF02/MaOmBASv72DPEOF96Zq8cngbi91LzugQ=;
        b=iiKJxh/wMlVeBDN7CJBRwwHsdvpli6QOJ7+EtchFZbHDItG0JvIZqGh6ihpLFtrxCs
         gIkrF/W5xZTPGq2jxyA2IKWheRMp7yUotfjq1XR2Fy7BJMEJr5ZWx/VQk1LK3GadM4nw
         hph8kzfe8sseXzHSQ/kYi6HVEi6TyTQOfx2hUxaBLuHyZAMzOuCl4DjB/MkgszS5kajt
         coZrvxUhgoyql7tISgT7SnQNeDZl9tCH9r3KrYe4102/Rz5BLPrNsGkZdtqAJHMFo6Fj
         CxMLzJ3mtzcrZfCKt6aebWfJzdV8RInESoYCzRR3y80g1ANwUv59TJJloeq5p2lPzcle
         STQA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1704903703; x=1705508503;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=6fNcFOOF02/MaOmBASv72DPEOF96Zq8cngbi91LzugQ=;
        b=Ho+fdr64Lt0CmXuCtNwJ3qu/YYwMpKlpFOyU8krQ/2vGPiiOw7QVNbSbzCFmgTT0iy
         ECWcMtq5kkh9yS4w8eU4VLg2XMfbvPoOoTOCDkuZu0d6fCbh7YTE3lEEXBjN64zKGYsH
         X+Jc583J8IUkwJ9uvjXdMLP8lwdYMT2VRIans/PaCaamucCpuluYVKipBE+0PxZm6cY6
         iB13VDpcCk3NDn2a83rcXTZROzaA4DUrgmqSZimgfhG7cbaw/8KA6pU7HG2ow/Avp/Ig
         DETFWKtPUb0+LDLqjmI2OdGMdUFP/lM/6Qnc6X1kUKPpUztSXZe9KSmt0uAd+YoJ7QvE
         9fJw==
X-Gm-Message-State: AOJu0YxTg+e+wwFtQba9wUE3Jg0CnSdRJ3MDEA7xKfSgaOHvVp8P9Qhc
	PlffkYdQ1b1YtQ++b75RXzP96MIt6sEkXVD6w/Q=
X-Google-Smtp-Source: AGHT+IEQ9bJRO/horP3seQiZtu5C1+DuV78IdXWZzDv11c7l5U8QnJ49vJkdCAKxttt5LSbuQDsuBe+zwv0yTZtA6O8=
X-Received: by 2002:a17:90b:b05:b0:28c:7e19:ef9 with SMTP id
 bf5-20020a17090b0b0500b0028c7e190ef9mr857125pjb.66.1704903703259; Wed, 10 Jan
 2024 08:21:43 -0800 (PST)
MIME-Version: 1.0
References: <20240103095650.25769-1-linyunsheng@huawei.com>
 <20240103095650.25769-4-linyunsheng@huawei.com> <f4abe71b3439b39d17a6fb2d410180f367cadf5c.camel@gmail.com>
 <74c9a3a1-5204-f79a-95ff-5c108ec6cf2a@huawei.com> <CAKgT0Uf=hFrXLzDFaOxs_j9yYP7aQCmi=wjUyuop3FBv2vzgCA@mail.gmail.com>
 <f138193c-30e0-b1ba-1735-5f569230724b@huawei.com> <CAKgT0UcujEktOnHx7mxWd+Jah1J9mHFWnTx35vc3x25uUadxaA@mail.gmail.com>
 <b77ef32e-64b9-2e4f-8041-ccb46dea4caa@huawei.com>
In-Reply-To: <b77ef32e-64b9-2e4f-8041-ccb46dea4caa@huawei.com>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Wed, 10 Jan 2024 08:21:06 -0800
Message-ID: <CAKgT0UfXv-2GrPY99-ZVZgjGEHvDqww6qLLT_Fben=_oNJM+xg@mail.gmail.com>
Subject: Re: [PATCH net-next 3/6] mm/page_alloc: use initial zero offset for page_frag_alloc_align()
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, 
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, 
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: BCD7F4000F
X-Stat-Signature: esj3a79sck4nzfhht46pq3j9fzdzawr7
X-HE-Tag: 1704903704-82311
X-HE-Meta: U2FsdGVkX1/30Qtoxo4bYiMZtfoP9CTTXJ/0jiW2PxIdqrfyt6kMhNgVDpOw/sY0TRHFw5Ops3z9VmbEPbxpzzUwFOwrMp+FPnFG0cMaKfykd3sShjolsQZCq5kevyqHkv+rlvStfkoS70uiXTc0d7nOBQen3un1/skCoNdKUyDjLqwqQqHLEFJ0pvOuEQq/AX+pM14g3UeQd86koHN+HXlOcVrF3fl/BE+QD3+QJw5plbPsRYHOgJn41uwDURD90rHEQcEkVplHsxK08DGzOcMqen51ohUOO4Kb8bHE4nQ5SFxMmvw7lCJrN06ocenUW+Tjx0UWY3UVP85K/X3JzN1RkhR0CLvfpvnK+RvNWt7uhVTuas4t/uRoFbH+wDbdhIwE5zuz0MmUMhLvkRsHs84B2jUr5zmu3g67y6RyPuwX6l3ME9o9C+yV85lzGjzLiycrw0mLAjpXVQ3PpwPAbmLJeZ69/IEfG1xc0PeR0reViGuEdff3pbZtQU4FlVELS9zY+l/4LMipIyvbSzhz1TI6E/DFw89oro1dISfu457MTU/Hu+xM/8aAldP/e2/lcG28zOhj+f15KzdRqehoXQngtvPALTI5a+KFNgVULYs0IlL5kAfAC/Iej9NH1tAH4l5tAF8/xw32nfzVlkOrVB0gB9gZAXm1qaZK92JB0lTHBzkAndP2djAYC4nYIyUdvixkTCTyDissb0ek4OxnmThfw3dI6pZJS8+GSXP7j78yDE6KU2b5vcMHq/hxcd03iIU2b+WsqsCnATvC/ZANYgAKzED/0/QL9XvZp+dtu8t8rPf6E1XA93sB6tsU88upq2BMDZcb/XMeV7QHRzbbOGlP/PCYnBxel1Yn02tEA9+oIjhzPhtiyTRpIBo5nl4Wv7U9HmrhDSIETKbAOLld6586lsEgFZbPJwQbQp6/ixAfo+tDTAeS3ZsunhxhclGOmCRqyZFzUpbhCISLbfD
 uLB9+Lv2
 j8ha5yFYfAiF01b3sBMKu8HcLrdFnZVFxDRl6VghDBNMbVBIO8dFa6Vx+CAhHX8a70LGOU5j6xSpFfN0so2ss2PrN2PpZaQsMJTwwhgtMtGqpuNxsd4b4QysvhmQB2M/ZauMODw//Rv0bQwZuilSeM2x/X0yLI+Xbm9nNz4nyXGugmKnqi5D4fnekQaqncyGxqY08q8GLiEGixY/5fMpYiwq48WUon+osogw58eMm9UF5ySFRl1/64XmOu4Olzx+YgiYYd2q4WNymOTCsngpOJ8iJktcUFlQp5iq7U0IkeoSV1+VZDRr7zk2jvJpvpLb7V46CdWTOuWwskLy5VDJ5ca8fUZj1EXcxhrCj7WidQqQ1LV78JOVf6xi3+wTqt49347vtL6SF6EEIaSrCkLXzw4uWsSssEPFk2yZA3MBmxyAvh15jkyPvoHtQLJLg7IUk+REmxJqcJgFG/KquxHEEMbybf1F0imlsG82veqpCfWcwLJHAZs5BkD0FSHZfBiUJdaZUoYD72COfZ9s=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Wed, Jan 10, 2024 at 1:45=E2=80=AFAM Yunsheng Lin <linyunsheng@huawei.co=
m> wrote:
>
> On 2024/1/9 23:37, Alexander Duyck wrote:
> > On Tue, Jan 9, 2024 at 3:22=E2=80=AFAM Yunsheng Lin <linyunsheng@huawei=
.com> wrote:
> >>
> >> On 2024/1/9 0:25, Alexander Duyck wrote:
> >>> On Mon, Jan 8, 2024 at 12:59=E2=80=AFAM Yunsheng Lin <linyunsheng@hua=
wei.com> wrote:
> >>
> >> ...
> >>
> >>>
> >>>>>
> >>>>> 2. By starting at the end and working toward zero we can use built =
in
> >>>>> functionality of the CPU to only have to check and see if our resul=
t
> >>>>> would be signed rather than having to load two registers with the
> >>>>> values and then compare them which saves us a few cycles. In additi=
on
> >>>>> it saves us from having to read both the size and the offset for ev=
ery
> >>>>> page.
> >>>>
> >>>> I suppose the above is ok if we only use the page_frag_alloc*() API =
to
> >>>> allocate memory for skb->data, not for the frag in skb_shinfo(), as =
by
> >>>> starting at the end and working toward zero, it means we can not do =
skb
> >>>> coalescing.
> >>>>
> >>>> As page_frag_alloc*() is returning va now, I am assuming most of use=
rs
> >>>> is using the API for skb->data, I guess it is ok to drop this patch =
for
> >>>> now.
> >>>>
> >>>> If we allow page_frag_alloc*() to return struct page, we might need =
this
> >>>> patch to enable coalescing.
> >>>
> >>> I would argue this is not the interface for enabling coalescing. This
> >>> is one of the reasons why this is implemented the way it is. When you
> >>> are aligning fragments you aren't going to be able to coalesce the
> >>> frames anyway as the alignment would push the fragments apart.
> >>
> >> It seems the alignment requirement is the same for the same user of a =
page_frag
> >> instance, so the aligning does not seem to be a problem for coalescing=
?
> >
> > I'm a bit confused as to what coalescing you are referring to. If you
> > can provide a link it would be useful.
> >
> > The problem is page_frag is a very generic item and can be generated
> > from a regular page on NICs that can internally reuse the same page
> > instance for multiple buffers. So it is possible to coalesce page
> > frags, however it is very unlikely to be coalescing them in the case
> > of them being used for skb buffers since it would require aligned
> > payloads on the network in order to really make it work without
> > hardware intervention of some sort and on such devices they are likely
> > allocating entire pages instead of page frags for the buffers.
>
> The main usecase in my mind is the page_frag used in the tx part for
> networking if we are able to unify the page_frag and page_frag_cache in
> the future:
> https://elixir.bootlin.com/linux/v6.7-rc8/source/net/ipv4/tcp.c#L1183
>
> Do you think if it makes sense to unify them using below unified struct,
> and provide API for returning 'page' and 'va' as page_pool does now?

Short answer is no. The difference between the two is the use case,
and combining page and va in the same struct just ends up generating
indirect data duplication. So one step would be to look at seeing what
we could do to either convert page to va or va to page without taking
a significant penalty in either page_frag or page_frag_cache use case.
I had opted for using the virtual address as the Rx path has a strong
need for accessing the memory as soon as it is written to begin basic
parsing tasks and the like. In addition it is usually cheaper to go
from a virtual to a page rather than the other way around.

As for the rest of the fields we have essentially 2 main use cases.
The first is the Rx path which usually implies DMA and not knowing
what size of the incoming frame is and the need to have allocation
succeed to avoid jamming up a device. So basically it is always doing
something like allocating 2K although it may only receive somewhere
between 60B to 1514B, and it is always allocating from reserves. For
something like Tx keeping the pagecnt_bias and pfmemalloc values
doesn't make much sense as neither is really needed for the Tx use
case. Instead they can just make calls to page_get as they slice off
pieces of the page, and they have the option of failing if they cannot
get enough memory to put the skb together.

> It may mean we need to add one pointer to the new struct and are not able
> do some trick for performance, I suppose that is ok as there are always
> some trade off for maintainability and evolvability?
>
> struct page_frag {
>         struct *page;
>         void *va;
> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>         __u16 offset;
>         __u16 size;
> #else
>         __u32 offset;
> #endif
>         /* we maintain a pagecount bias, so that we dont dirty cache line
>          * containing page->_refcount every time we allocate a fragment.
>          */
>         unsigned int            pagecnt_bias;
>         bool pfmemalloc;
> };

My general thought was to instead just make the page_frag a member of
the page_frag_cache since those two blocks would be the same. Then we
could see how we evolve things from there. By doing that there should
essentially be no code change

> Another usecase that is not really related is: hw may be configured with
> a small BD buf size, for 2K and configured with a big mtu size or have
> hw gro enabled, for 4K pagesize, that means we may be able to reduce the
> number of the frag num to half as it is usually the case that two
> consecutive BD pointing to the same page. I implemented a POC in hns3
> long time ago using the frag implememtation in page_pool, it did show
> some obvious peformance gain, But as the priority shifts, I have not
> been able to continue that POC yet.

The main issue for that use case is that in order to make it work you
are having to usually copybreak the headers out of the page frags as
you won't be able to save the space for the skb tailroom. Either that
or you are using header/data split and in that case the data portion
should really be written to full pages instead of page fragments
anyway and be making use of page pool instead.