From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ACDA0C3DA42
	for <linux-mm@archiver.kernel.org>; Sat, 13 Jul 2024 16:56:38 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id ECBBD6B0088; Sat, 13 Jul 2024 12:56:37 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E7ACF6B008A; Sat, 13 Jul 2024 12:56:37 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D1B6C6B008C; Sat, 13 Jul 2024 12:56:37 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id B42C26B0088
	for <linux-mm@kvack.org>; Sat, 13 Jul 2024 12:56:37 -0400 (EDT)
Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id 503EF40EAB
	for <linux-mm@kvack.org>; Sat, 13 Jul 2024 16:56:37 +0000 (UTC)
X-FDA: 82335333234.06.674164F
Received: from mail-wr1-f51.google.com (mail-wr1-f51.google.com [209.85.221.51])
	by imf28.hostedemail.com (Postfix) with ESMTP id 6FDB1C0010
	for <linux-mm@kvack.org>; Sat, 13 Jul 2024 16:56:35 +0000 (UTC)
Authentication-Results: imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=fojgNdW1;
	spf=pass (imf28.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1720889778;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=cNt1wrkptkd8WR0JstD22LLnVmr+AtU5Ut9aSaxTWAM=;
	b=jmOow9LOGG36pn4Ur99eQyXulQfxS5r1KDnd4C0zHBOUyycjdmhkF0rfDPsMUMfSBWJRvb
	9kO5Y6eMg+uQQHRJGEfxo9yh2UuEOC2LOd+ffChjzp7tEAz1gqYZENxMizfx/272sMGWqo
	Jzdtw9qq4ep4EruVGTvKvVlwjcRVRZM=
ARC-Authentication-Results: i=1;
	imf28.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=fojgNdW1;
	spf=pass (imf28.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.51 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1720889778; a=rsa-sha256;
	cv=none;
	b=FUoh6I1sTBlso/XnxRArSdC4E+5UJzqfaB7wdWel8l6utzNUVnLNklrztEViJfUoAKYxfW
	kRjKnM+s5cqbkIUm6LljQ1CuieOEu1nMQ8z8CPICKgdyCsj3RkmF7iDUQoR1XbdTeA58q0
	Aq81yj6MreUqst3KLNW4E+vPH9zOygs=
Received: by mail-wr1-f51.google.com with SMTP id ffacd0b85a97d-3679e9bfb08so1646862f8f.1
        for <linux-mm@kvack.org>; Sat, 13 Jul 2024 09:56:35 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1720889794; x=1721494594; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=cNt1wrkptkd8WR0JstD22LLnVmr+AtU5Ut9aSaxTWAM=;
        b=fojgNdW1kgJRN7fA60a0zpMyX8+7KhghM6FXIIA2+j71Cn4ot/Iq19bMAQX6BpYmkC
         0wN3gh7TRru9EqkCBx3QA1zTo/fApvyQ/zIGOYTc/CF431g1a+hDZFd3pgEhMThjTJTf
         PLjw4VFUQhDGs93YVZuHQC4JWmzOHVwAkRn1nLa0yZFxJcV71TIhftk66AUQIrMnt4O1
         incQBN++y/zHuQ6iv/UGRne71FeXNHimcen+HrqmLra8tvNl1/n4fJ6f+YcotK0h0hJa
         J08gwWj3xwTPwskDn/QVoXwz8aidIPmzJmH7w8OXKnTIEo+mWJvm0Fzkr6/NFy11bZ7c
         cygw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1720889794; x=1721494594;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=cNt1wrkptkd8WR0JstD22LLnVmr+AtU5Ut9aSaxTWAM=;
        b=Ztf/t8UGhAaY80HU/m4cvqNhcWWdqaGkEXBJpcN22oUi8tWzvHjwrH2Ml+BQYmhrso
         QfzwGsXaBrgQvPgai0XP3esfZd1aS5QdQ6h1Wci1YYDu77UlE6QjFLMEGzxQZX5oIYr8
         STRcJ3t0mfIzOJ1P2coDu1e2U/ziCJhoKGdrJncyQNDLbpv+NUeiVvX6hjp0k9+K9ihI
         /3g+1MxghbOOk7QxlV1fROz3Jt8/0dS912TcdJUrf3RqCGfvg5z8w2owaMYJbE0Gn5nR
         IWjn2kEklZ0RAhr7bq6gQbSLSSGkWFQMVVcUij3t7ppmO3m3QpyfpsrjiOj3yGAjXa7o
         jlaA==
X-Forwarded-Encrypted: i=1; AJvYcCW+q39UcMVB6zGzew2Bv40btbLdIGCYKDT4M48+4zS06IxPuui+Pm8oq9dVcwMIoZQjlhFXH0t18VVRtpVtWgut94Q=
X-Gm-Message-State: AOJu0YyhT66DvW84IExBYyKHBR1yg6ZZQxRjoFSPxvC08BkSnzusy+4L
	erYJAFO1qrbwVqoynINFizvRdViGkzZ9XzM26pClwqwtAET+FSqKuA1dY1cPbpyeUTs6mKQgvVi
	AmCiBJJBX6jAI5pdP5t0NjbGbMs0=
X-Google-Smtp-Source: AGHT+IFM1aA9fPQwig9kG5G53o8ZVZdSWMlzL8lNEGzlwh+Yw9AUubW1OizJr1MnA3tS1MsWUsOY3FLesOPBVz8bLKI=
X-Received: by 2002:a5d:5f45:0:b0:368:6bb:f79e with SMTP id
 ffacd0b85a97d-36806bbf7e9mr5375120f8f.4.1720889793488; Sat, 13 Jul 2024
 09:56:33 -0700 (PDT)
MIME-Version: 1.0
References: <20240625135216.47007-1-linyunsheng@huawei.com>
 <20240625135216.47007-7-linyunsheng@huawei.com> <12a8b9ddbcb2da8431f77c5ec952ccfb2a77b7ec.camel@gmail.com>
 <808be796-6333-c116-6ecb-95a39f7ad76e@huawei.com> <a026c32218cabc7b6dc579ced1306aefd7029b10.camel@gmail.com>
 <f4ff5a42-9371-bc54-8523-b11d8511c39a@huawei.com> <96b04ebb7f46d73482d5f71213bd800c8195f00d.camel@gmail.com>
 <5daed410-063b-4d86-b544-d1a85bd86375@huawei.com> <CAKgT0UdJPcnfOJ=-1ZzXbiFiA=8a0z_oVBgQC-itKB1HWBU+yA@mail.gmail.com>
 <df38c0fb-64a9-48da-95d7-d6729cc6cf34@huawei.com> <CAKgT0UdSjmJoaQvTOz3STjBi2PazQ=piWY5wqFsYFBFLcPrLjQ@mail.gmail.com>
 <29e8ac53-f7da-4896-8121-2abc25ec2c95@gmail.com>
In-Reply-To: <29e8ac53-f7da-4896-8121-2abc25ec2c95@gmail.com>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Sat, 13 Jul 2024 09:55:56 -0700
Message-ID: <CAKgT0Udmr8q8V7x6ZqHQVxFbCnwB-6Ttybx_PP_3Xr9X-DgjKA@mail.gmail.com>
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for
 'size' and 'pfmemalloc'
To: Yunsheng Lin <yunshenglin0825@gmail.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>, davem@davemloft.net, kuba@kernel.org, 
	pabeni@redhat.com, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, 
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-Rspamd-Queue-Id: 6FDB1C0010
X-Stat-Signature: 1dkair8aes7cjqtco3wgz88jwqj8stri
X-HE-Tag: 1720889795-294537
X-HE-Meta: U2FsdGVkX19pdtWfVk2kPk2Yf/gEiys9NXiJpzlkE4HKrME3ptrwrYe0yK8xeqGn1ZRjHX2LnjpS+tZicwH3ExOhXUzrOFhFPB9+NkPUd9DTpd4K5LttZHKQbzio7iRFdkrbYulM4vkfxFIZi9XgXkdgZhJqJc3TSRVjwidA/Ankj8HkE7DdnETkad6PDf28mR+Prx4vEAf0OT0BFYYmJfwXE8j+piHb5sK+PxQjc+OaHdNS/n5SJ71HxiWh2hKXCm9sfkD+IEcywFZ8WbCRz/Bq3IC3NhGa3QF3NJc87+fH5tww6T5XfT2ZTzKnSduCcgrlfWUckpVEo2uWkoVfNo/Dq61w0DslSUDOz4KxnEJy2s7U9t09ppT/5jFKghNSpZNAUl+ME/HmVt7cjT6hg/fMXJ5b6TgLlspGjMvEGLCVuilxcQLJMET6sq+S8oDcKJ6OojbbGxdDa+Vjmeyuip+hb7OQrYPczHLiVtCrce3ji95Im9zCVQMsILDBOfzei0KG96OOoArbusu222faEDnEHN8IHO1B69RS2OblRF6SLqHv5HMY98iTLWo/jt0a23QdWqg2TxT3rA9P1horakF2dZsMxT05rS7N3z59on/mSNRetB74MHg8rHmbSXEYjJcSZ1nGXl3ojsdGi8Y8yCvRybiUY2OpDuW3z8RTO9YpnhwTcvqMXQciFBDEHSAuY0Lcg0PaobJKcrmVYwxlGxvN+CXw2tmXheQ7nobzXQwM8nat7MAtO6NKMl8Vps1yCVKSgdIJNLAa3cu/7L9NDFVoqwNvZqgUUw2n0jZotpKOc67CLtaXILDP1IwuGhinYY2bTii7Xf2oISLGvA9kmYKpr7+/jxfpVSDIGGrq9p8jmQ2rv1vsgwOZOC1xh/A+wtlnSa+0Bm8N/86CbV1+g0Fs03iUqpc7iVKkorjLZU8t/Oqd0O0e+zQSD9iHFQGOZheYw+rqxxDe9z2clYD
 yFsbMrYu
 b3RqdVDYN1dczLTirUEdXP3edxkShyszVsSJ/IzR2ObEXT5J8q5TL2I0nVQh3Q/EIrJvmunHZyj5fcDi+5EUfTzoUMkoCfZ/KHYOtrju9XsNWgmirjaC9ote8iyLHHUb+/QU8YTuIoBnmCdcneeX9SNf5YF3Uxbco8v4C44Q7FOeXo+AokNjGRaOCO5hsWCyOdBNs6LwMmaUTshD6yLEf8Acz/WWxuAOnR3/xKZyxYEwKx5eiPuGBCdJl3nGxnEwe6mbRy/pBXtiGtpB3zC9HISIVDtk8bGEf66Rbe1zowWWynpdxaXgmgcAxVTmYEYcqFmGkzZP62TdKN/fHXsVddjD6jIRUKGLAcrEZf2JDeT9uNALgJZPrpsxdhnAR0d5J0IOF/CuNkWgKEDsN9INsb0RUTvFwNL4vKZHxRUuYvBe16VbmJxyuNv7yQq6zTI1kcXb+XyH/7HQyUiOFmqiaDvm+1ojeXm1BpabOjDa/qtftJUAG4oJswU1TPw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Jul 12, 2024 at 10:20=E2=80=AFPM Yunsheng Lin <yunshenglin0825@gmai=
l.com> wrote:
>
> On 7/13/2024 12:55 AM, Alexander Duyck wrote:
>
> ...
>
> >>
> >> So it is about ensuring the additional room due to alignment requireme=
nt
> >> being placed at the end of a fragment, in order to avoid false sharing
> >> between the last fragment and the current fragment as much as possible=
,
> >> right?
> >>
> >> I am generally agreed with above if we can also ensure skb coaleasing =
by
> >> doing offset count-up instead of offset countdown.
> >>
> >> If there is conflict between them, I am assuming that enabling skb fra=
g
> >> coaleasing is prefered over avoiding false sharing, right?
> >
> > The question I would have is where do we have opportunities for this
> > to result in coalescing? If we are using this to allocate skb->data
> > then there isn't such a chance as the tailroom gets in the way.
> >
> > If this is for a device allocating for an Rx buffer we won't get that
> > chance as we have to preallocate some fixed size not knowing the
> > buffer that is coming, and it needs to usually be DMA aligned in order
> > to avoid causing partial cacheline reads/writes. The only way these
> > would combine well is if you are doing aligned fixed blocks and are
> > the only consumer of the page frag cache. It is essentially just
> > optimizing for jumbo frames in that case.
>
> And hw-gro or sw-gro.

No and no. The problem is for hw-gro in many cases the device will be
DMA aligning the start of writes for each new frame that comes in. As
such it is possible you won't be able to make use of this unless the
device is either queueing up the entire packet before writing it to
memory, or taking a double penalty for partial cache line writes which
will negatively impact performance. For sw-gro it won't happen since
as I mentioned the device is doing DMA aligned writes usually w/ fixed
size buffers that will be partially empty. As such coalescing will
likely not be possible in either of those scenarios.

This is why I was so comfortable using the reverse ordering for the
allocations. Trying to aggregate frames in this way will be very
difficult and the likelihood of anyone ever needing to do it is
incredibly small.

> >
> > If this is for some software interface why wouldn't it request the
> > coalesced size and do one allocation rather than trying to figure out
> > how to perform a bunch of smaller allocations and then trying to merge
> > them together after the fact.
>
> I am not sure I understand what 'some software interface' is referring
> to, I hope you are not suggesting the below optimizations utilizing of
> skb_can_coalesce() checking is unnecessary:(
>
> https://elixir.bootlin.com/linux/v6.10-rc7/C/ident/skb_can_coalesce
>
> Most of the usecases do that for the reason below as mentioned in the
> Documentation/mm/page_frags.rst as my understanding:
> "There is also a use case that needs minimum memory in order for forward
> progress, but more performant if more memory is available."

No what I am talking about is that it will be expensive to use and
have very little benefit to coalesce frames coming from a NIC as I
called out above. Most NICs won't use page frags anyway they will be
using page pool which is a slightly different beast anyway.

> >
> >>>
> >>>> The above is why I added the below paragraph in the doc to make the =
semantic
> >>>> more explicit:
> >>>> "Depending on different aligning requirement, the page_frag API call=
er may call
> >>>> page_frag_alloc*_align*() to ensure the returned virtual address or =
offset of
> >>>> the page is aligned according to the 'align/alignment' parameter. No=
te the size
> >>>> of the allocated fragment is not aligned, the caller needs to provid=
e an aligned
> >>>> fragsz if there is an alignment requirement for the size of the frag=
ment."
> >>>>
> >>>> And existing callers of page_frag aligned API does seems to follow t=
he above
> >>>> rule last time I checked.
> >>>>
> >>>> Or did I miss something obvious here?
> >>>
> >>> No you didn't miss anything. It is just that there is now more
> >>> potential for error than there was before.
> >>
> >> I guess the 'error' is referred to the 'false sharing' mentioned above=
,
> >> right? If it is indeed an error, are we not supposed to fix it instead
> >> of allowing such implicit implication? Allowing implicit implication
> >> seems to make the 'error' harder to reproduce and debug.
> >
> > The concern with the code as it stands is that if I am not mistaken
> > remaining isn't actually aligned. You aligned it, then added fragsz.
> > That becomes the start of the next frame so if fragsz isn't aligned
> > the next requester will be getting an unaligned buffer, or one that is
> > only aligned to the previous caller's alignment.
>
> As mentioned below:
> https://lore.kernel.org/all/3da33d4c-a70e-23a4-8e00-23fe96dd0c1a@huawei.c=
om/
>
> what alignment semantics are we providing here:
> 1. Ensure alignment for both offset and fragsz.
> 2. Ensure alignment for offset only.
> 3. Ensure alignment for fragsz only.
>
> As my understanding, the original code before this patchset is only
> ensuring alignment for offset too. So there may be 'false sharing'
> both before this patchset and after this patchset. It would be better
> not to argue about which implementation having more/less potential
> to avoid the 'false sharing', it is an undefined behavior, the argument
> would be endless depending on usecase and personal preference.
>
> As I said before, I would love to retain the old undefined behavior
> when there is a reasonable way to support the new usecases.

My main concern is that you were aligning "remaining", then adding
fragsz, and storing that. That was then used as the offset for the
next buffer. That isn't aligned since fragsz and the previous offset
aren't guaranteed to align with the new allocation.

So at a minimum the existing code cannot be using nc->remaining when
generating the pointer to the page. Instead it has to be using the
aligned version of that value that you generated before adding fragsz
and verifying there is space.

> >
> >>>
> >>>>> need to align the remaining, then add fragsz, and then I guess you
> >>>>> could store remaining and then subtract fragsz from your final virt=
ual
> >>>>> address to get back to where the starting offset is actually locate=
d.
> >>>>
> >>>> remaining =3D __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask);
> >>>> remaining +=3D fragsz;
> >>>> nc->remaining =3D remaining;
> >>>> return encoded_page_address(nc->encoded_va) + (size + remaining) - f=
ragsz;
> >>>>
> >>>> If yes, I am not sure what is the point of doing the above yet, it
> >>>> just seem to make thing more complicated and harder to understand.
> >>>
> >>> That isn't right. I am not sure why you are adding size + remaining o=
r
> >>> what those are supposed to represent.
> >>
> >> As the above assumes 'remaining' is a negative value as you suggested,
> >> (size + remaining) is supposed to represent the offset of next fragmen=
t
> >> to ensure we have count-up offset for enabling skb frag coaleasing, an=
d
> >> '- fragsz' is used to get the offset of current fragment.
> >>
> >>>
> >>> The issue was that the "remaining" ends up being an unaligned value a=
s
> >>> you were starting by aligning it and then adding fragsz. So by
> >>> subtracting fragsz you can get back to the aliglined start. What this
> >>> patch was doing before was adding the raw unaligned nc->remaining at
> >>> the end of the function.
> >>>
> >>>>>
> >>>>> Basically your "remaining" value isn't a safe number to use for an
> >>>>> offset since it isn't aligned to your starting value at any point.
> >>>>
> >>>> Does using 'aligned_remaining' local variable to make it more obviou=
s
> >>>> seem reasonable to you?
> >>>
> >>> No, as the value you are storing above isn't guaranteed to be aligned=
.
> >>> If you stored it before adding fragsz then it would be aligned.
> >>
> >> I have a feeling that what you are proposing may be conflict with enab=
ling
> >> skb frag coaleasing, as doing offset count-up seems to need some room =
to
> >> be reserved at the begin of a allocated fragment due to alignment requ=
irement,
> >> and that may mean we need to do both fragsz and offset aligning.
> >>
> >> Perhaps the 'remaining' changing in this patch does seems to make thin=
gs
> >> harder to discuss. Anyway, it would be more helpful if there is some p=
seudo
> >> code to show the steps of how the above can be done in your mind.
> >
> > Basically what you would really need do for all this is:
> >    remaining =3D __ALIGN_KERNEL_MASK(nc->remaining, ~align_mask);
> >    nc->remaining =3D remaining + fragsz;
> >    return encoded_page_address(nc->encoded_va) + size + remaining;
>

I might have mixed my explanation up a bit. This is assuming remaining
is a negative value as I mentioned before.

Basically the issue is your current code is using nc->remaining to
generate the current address and that is bad as it isn't aligned to
anything as fragsz was added to it and no alignment check had been
done on that value.