From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9E5CDC4345F
	for <linux-mm@archiver.kernel.org>; Tue, 30 Apr 2024 14:55:26 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1C9676B00B0; Tue, 30 Apr 2024 10:55:26 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 179136B00B1; Tue, 30 Apr 2024 10:55:26 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0196F6B00B7; Tue, 30 Apr 2024 10:55:25 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id D611B6B00B0
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 10:55:25 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 6DEEFC01C1
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 14:55:25 +0000 (UTC)
X-FDA: 82066496610.27.21B5A83
Received: from mail-wr1-f43.google.com (mail-wr1-f43.google.com [209.85.221.43])
	by imf04.hostedemail.com (Postfix) with ESMTP id 9580940014
	for <linux-mm@kvack.org>; Tue, 30 Apr 2024 14:55:23 +0000 (UTC)
Authentication-Results: imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=IjpVoR3e;
	spf=pass (imf04.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.43 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1714488923;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=lu+OI8E2hEfJqf3bGvX533+gBV2a5UR/rM3gejZ5w40=;
	b=7XOtI4ZM3PhDCPxKxTCw9fG2DMh4CKLVLZG/zKQQLNL6kG0UKz7OYk1t5NQjM7ToAm3ZaL
	tVpUdL49PR7StvTS17hlDvQIKkWYpzfcNYMlnNC/BPApKqg0l4X5a9NM+AFKYLnWBJDTel
	nSr5/iSqU2T+fyD/qE41tm9Z2jd8d/8=
ARC-Authentication-Results: i=1;
	imf04.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=IjpVoR3e;
	spf=pass (imf04.hostedemail.com: domain of alexander.duyck@gmail.com designates 209.85.221.43 as permitted sender) smtp.mailfrom=alexander.duyck@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714488923; a=rsa-sha256;
	cv=none;
	b=4QNT+qNIzazV9d9arNrLQw1pn9wgt7ruW/VSpvK2bu6C/vGFmWIAk9ENcIvyFGKVCz2Cwt
	egUqCnNhAGORJmSNegIGlPZCwR0uE93FrpeIJaPmt3/xta6+1aWpc/Gv3jfueggO/eJFsc
	71oOTGUaVV9CnzXVGxYlsG9dnuJYV6Q=
Received: by mail-wr1-f43.google.com with SMTP id ffacd0b85a97d-34db22374f3so147845f8f.2
        for <linux-mm@kvack.org>; Tue, 30 Apr 2024 07:55:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1714488922; x=1715093722; darn=kvack.org;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:from:to:cc:subject:date
         :message-id:reply-to;
        bh=lu+OI8E2hEfJqf3bGvX533+gBV2a5UR/rM3gejZ5w40=;
        b=IjpVoR3eZMxxyDnbiak92pljoriGAn5TCAKue8PUB3I4dT5txUqyyAXtvoYa+2Gn/z
         K3fj2JNUG71cATAvTQb7AvMFvWQvMfrKmCjb39trbSstiSMhmxXXfKNKhd3CjmmnB1t+
         d1c5ieR/XUm2so7iSQbGvFyGU3wrTp2Z9NY7f3txzesoT8qEzfmtJZouG/ZA5jfGRBJE
         Sa8h4DucmOWiMuSnBvLGLzEEnApZdtJnYXVF9r27UmwoDdfZCahc5l0WDWz+2g1jY+z9
         ie13YHobSnPQsbSeq2wneb/FXwlMcg4YoHv0srSQRbo08gES4xEJ9ckeL3KT//IFGxlu
         N/KA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1714488922; x=1715093722;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=lu+OI8E2hEfJqf3bGvX533+gBV2a5UR/rM3gejZ5w40=;
        b=pAnL9U+KvwnTt8glJI+wItpg9weiQgfYUYaQniRbRhVmsIHRlBh9sr87wNltNkFP4G
         K/hJeksL6EfJmNSCoix0fgjHWWInqvN1ATD774e5rbhiz8Ni2y1Ia6rrDtTOx5dXdpTH
         EnUUxHrlZMzDtmv501xY21i7xouBvR5OESP8Qu4fM01fbogXxEqiWUkGrYX36/ouGo6n
         A7ge1wqO6gcrDdmJw1TdZmdSOlftr0lFrQFwFXlPpN7PPcebo0uAjVVjHrHSenplPV58
         l1TcdX4E0CH2iR4SGk3gEGGiWshXuC9dGhPoyQPdbSe3LcRP0QHpTyW/i6gFdlc5Kh0u
         XqJw==
X-Forwarded-Encrypted: i=1; AJvYcCVT9TucPVwawiPkeDcqqqYZLOLheoHrPB8qUR/Sm++JfGc6gTyJW/QdVa+w+umPUpycJPBlBqvt+BvHwdMfgFjxYQw=
X-Gm-Message-State: AOJu0YxazYUq7FMDzd9e/WsxZqhxxVe9fjUc8VQw2zOrmdQLqEcqLL4O
	IO+FOpttq4vnPBOaoSkj/bsxxbCEL7LrlfFl+X1DnnpZ3pOM7eiBXQP+IB14DMddIoUzfTl4eeq
	RwEDD7iqrNJW6BXZqxb1izEAAbhPlyg==
X-Google-Smtp-Source: AGHT+IHUeit3DF3REJfQsH53Mx8vNwtloiw61MyXi7vr4lHdPZUfCq25XDOK386/J2Xhb+DZU49ErLiQMhD8XDnWtkI=
X-Received: by 2002:a05:6000:1041:b0:34b:d659:7d0f with SMTP id
 c1-20020a056000104100b0034bd6597d0fmr6752062wrx.16.1714488921771; Tue, 30 Apr
 2024 07:55:21 -0700 (PDT)
MIME-Version: 1.0
References: <20240415131941.51153-1-linyunsheng@huawei.com>
 <20240415131941.51153-10-linyunsheng@huawei.com> <37d012438d4850c3d7090e784e09088d02a2780c.camel@gmail.com>
 <8b7361c2-6f45-72e8-5aca-92e8a41a7e5e@huawei.com> <17066b6a4f941eea3ef567767450b311096da22b.camel@gmail.com>
 <c45fdd75-44be-82a6-8e47-42bbc5ee4795@huawei.com> <efd21f1d-8c67-b060-5ad2-0d500fac2ba6@huawei.com>
 <CAKgT0UfQWEkaWM_mfk=FhCErTL_ZS3RL6x3iMzPdEP3FD+9zZQ@mail.gmail.com> <ceb36a97-31b5-62df-a216-8598210bbba8@huawei.com>
In-Reply-To: <ceb36a97-31b5-62df-a216-8598210bbba8@huawei.com>
From: Alexander Duyck <alexander.duyck@gmail.com>
Date: Tue, 30 Apr 2024 07:54:45 -0700
Message-ID: <CAKgT0Ufm0=1cmyRLcrcu1_FAAeBokj3rpFAXJvVxgARXSStAuA@mail.gmail.com>
Subject: Re: [PATCH net-next v2 09/15] mm: page_frag: reuse MSB of 'size'
 field for pfmemalloc
To: Yunsheng Lin <linyunsheng@huawei.com>
Cc: davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com, 
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org, 
	Andrew Morton <akpm@linux-foundation.org>, linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 9580940014
X-Rspamd-Server: rspam06
X-Stat-Signature: yg8uz1a9d56jctp5fcywmzgf6cqq4jh4
X-HE-Tag: 1714488923-926586
X-HE-Meta: U2FsdGVkX1+yL9GQefIt76bJjC1e/EL8UQMACZQ2cKeHoLuxqrZCfueTlVXc6ys2SvpOcYabgSQo1nCBq7+iKuN0on9R2Mo1YJuAXgEwLd1D4FKo+YhC6wAW4PB2WAZoperu/yKnk6CImsS1Kj+nd8zi3f/4a9AeKvstwJsmYtWBQCW2EgI/7X4W7LhIA78lCLUopCxEr+vNNfyI6o4eTaA8vb+ZG83V8gHUgXlag9XS1EkXtKsL6dSYUQ1Aro19Vd+8yzSj27m1HMX5XL9PUwyzItePt9WkvXhNUeamMORnyho1orc33wrBEmb7nro0Luofn8Ds5ZXPjK5QYIRDJpXmxjz0lYy2DCptlQ+h6XGCToumxW1mS1VYDJTDfjmUxf7dLTjPaY5lCsGmAHGLl9s+ijdKUEdcX7TNRtcGncQDMj1PuECEOsN/wd+hdbv4AVGgF29j7coCsZTG4XvFBJbbQqXwZ37iEpEZOgk93q/RHhK2uYIbXBpEtLXxCw9xVyt4FAdrqdfTBImep30v1ZFomZhrVa7eVLLcADKnwMR+RAW7RvlcziDqn4GNLD/F5AiUSuiNieRID1PzeD19VImGO/EVssHE+9MbXcATAQgq3Rw15AgxKNsbc6tkZC3xUHa/CU0ONKU7l/n+kXTej89Mxt+pMHwns1Z8/HGZ7PaNKspj0/+closA8CSVseEcYGrsrIyP7RgHxS5RXE2PX8w8UDVqda2hiQV40IlfCIueWrbZRtA9sOroJdNj6WYlXSm6YxWr3Zh13330vmIuQp+0kWfcpC3kH5Hp8BsoMOFgQjhoPo5ys1/bExA8X4o9OMZbna3/Y6PiE05apYtUbAZXCUNR0VBNmAsvQgm8F8BA8+YkzQTxyMnuE5Ul2RtX6/sMqH8DD53fp7kho3ULjY3mFJdL90u5Uu2RvpTHckAc7lO8zm02yZMmU2vvP6bOfDEW3aCrYYuh98oBSIt
 K8NYVQhL
 ZTXR78Mq3JcsczWtmIT0Z04US81i5j1K85gDtxTjuMffr4/bO4wmYp8dxEyAiZSRXFn8F877JifMJRsPkd8uEk4diJp+LmvGca2vXHu7/rWkeJ7DeOz0FhIGPMt4ABVly7+IlmTFPwLCTP9tNXpQ010mXjnh8vrFQd2469/o/HyWHNn9Xqr/Aefpj4lFYwTmRtVtnuBp4znmjaQHwt5b8vzDz63r+PeNbELK5MoIePVX3ywBkEReUAxag292IPlxXyEvj0lsXo7o1UtD0Xwucf9eMLb684LhnBGTx6x97udwcgBCsUCm52kGORLb9+vNnXmkhRfjBFhgCvmTb5clOahwrCqjP1jucqCyB7PV10QCU53s45AZVVzywohE+RxB6YrgtIbvnrahjQXUX6xVzl7N4RMExqVVbWUQtOwSYFFuQxKSfnJaL9/9UwbqG0irotthS6qLgAGxgl/g=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Apr 30, 2024 at 5:06=E2=80=AFAM Yunsheng Lin <linyunsheng@huawei.co=
m> wrote:
>
> On 2024/4/29 22:49, Alexander Duyck wrote:
>
> ...
>
> >>>
> >>
> >> After considering a few different layouts for 'struct page_frag_cache'=
,
> >> it seems the below is more optimized:
> >>
> >> struct page_frag_cache {
> >>         /* page address & pfmemalloc & order */
> >>         void *va;
> >
> > I see. So basically just pack the much smaller bitfields in here.
> >
> >>
> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <=3D 32)
> >>         u16 pagecnt_bias;
> >>         u16 size;
> >> #else
> >>         u32 pagecnt_bias;
> >>         u32 size;
> >> #endif
> >> }
> >>
> >> The lower bits of 'va' is or'ed with the page order & pfmemalloc inste=
ad
> >> of offset or pagecnt_bias, so that we don't have to add more checking
> >> for handling the problem of not having enough space for offset or
> >> pagecnt_bias for PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE and 32 bits syst=
em.
> >> And page address & pfmemalloc & order is unchanged for the same page
> >> in the same 'page_frag_cache' instance, it makes sense to fit them
> >> together.
> >>
> >> Also, it seems it is better to replace 'offset' with 'size', which ind=
icates
> >> the remaining size for the cache in a 'page_frag_cache' instance, and =
we
> >> might be able to do a single 'size >=3D fragsz' checking for the case =
of cache
> >> being enough, which should be the fast path if we ensure size is zoro =
when
> >> 'va' =3D=3D NULL.
> >
> > I'm not sure the rename to size is called for as it is going to be
> > confusing. Maybe something like "remaining"?
>
> Yes, using 'size' for that is a bit confusing.
> Beside the above 'remaining', by googling, it seems we may have below
> options too:
> 'residual','unused', 'surplus'
>
> >
> >> Something like below:
> >>
> >> #define PAGE_FRAG_CACHE_ORDER_MASK      GENMASK(1, 0)
> >> #define PAGE_FRAG_CACHE_PFMEMALLOC_BIT  BIT(2)
> >
> > The only downside is that it is ossifying things so that we can only
>
> There is 12 bits that is always useful, we can always extend ORDER_MASK
> to more bits if lager order number is needed.
>
> > ever do order 3 as the maximum cache size. It might be better to do a
> > full 8 bytes as on x86 it would just mean accessing the low end of a
> > 16b register. Then you can have pfmemalloc at bit 8.
>
> I am not sure I understand the above as it seems we may have below option=
:
> 1. Use somthing like the above ORDER_MASK and PFMEMALLOC_BIT.
> 2. Use bitfield as something like below:
>
> unsigned long va:20;---or 52 for 64bit system
> unsigned long pfmemalloc:1
> unsigned long order:11;
>
> Or is there a better idea in your mind?

All I was suggesting was to make the ORDER_MASK 8 bits. Doing that the
compiler can just store VA in a register such as RCX and instead of
having to bother with a mask it could then just use CL to access the
order. As far as the flag bits such as pfmemalloc the 4 bits starting
at 8 would make the most sense since it doesn't naturally align to
anything and would have to be masked anyway.

Using a bitfield like you suggest would be problematic as the compiler
would assume a shift is needed so you would have to add a shift to
your code to offset it for accessing VA.

> >
> >> struct page_frag_cache {
> >>         /* page address & pfmemalloc & order */
> >>         void *va;
> >>
> >
> > When you start combining things like this we normally would convert va
> > to an unsigned long.
>
> Ack.
>
> >
> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <=3D 32)
> >>         u16 pagecnt_bias;
> >>         u16 size;
> >> #else
> >>         u32 pagecnt_bias;
> >>         u32 size;
> >> #endif
> >> };
> >>
> >>
> >> static void *__page_frag_cache_refill(struct page_frag_cache *nc,
> >>                                       unsigned int fragsz, gfp_t gfp_m=
ask,
> >>                                       unsigned int align_mask)
> >> {
> >>         gfp_t gfp =3D gfp_mask;
> >>         struct page *page;
> >>         void *va;
> >>
> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> >>         /* Ensure free_unref_page() can be used to free the page fragm=
ent */
> >>         BUILD_BUG_ON(PAGE_FRAG_CACHE_MAX_ORDER > PAGE_ALLOC_COSTLY_ORD=
ER);
> >>
> >>         gfp_mask =3D (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP =
|
> >>                    __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >>         page =3D alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> >>                                 PAGE_FRAG_CACHE_MAX_ORDER);
> >>         if (likely(page)) {
> >>                 nc->size =3D PAGE_FRAG_CACHE_MAX_SIZE - fragsz;
> >
> > I wouldn't pass fragsz here. Ideally we keep this from having to get
> > pulled directly into the allocator and can instead treat this as a
> > pristine page. We can do the subtraction further down in the actual
> > frag alloc call.
>
> Yes for the maintanability point of view.
> But for performance point of view, doesn't it make sense to do the
> subtraction here, as doing the subtraction in the actual frag alloc
> call may involve more load/store operation to do the subtraction?

It just means more code paths doing different things. It doesn't add
much here since what you are doing is juggling more variables in this
function as a result of this.

> >
> >>                 va =3D page_address(page);
> >>                 nc->va =3D (void *)((unsigned long)va |
> >>                                   PAGE_FRAG_CACHE_MAX_ORDER |
> >>                                   (page_is_pfmemalloc(page) ?
> >>                                    PAGE_FRAG_CACHE_PFMEMALLOC_BIT : 0)=
);
> >>                 page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
> >>                 nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE;
> >>                 return va;
> >>         }
> >> #endif
> >>         page =3D alloc_pages_node(NUMA_NO_NODE, gfp, 0);
> >>         if (likely(page)) {
> >>                 nc->size =3D PAGE_SIZE - fragsz;
> >>                 va =3D page_address(page);
> >>                 nc->va =3D (void *)((unsigned long)va |
> >>                                   (page_is_pfmemalloc(page) ?
> >>                                    PAGE_FRAG_CACHE_PFMEMALLOC_BIT : 0)=
);
> >>                 page_ref_add(page, PAGE_FRAG_CACHE_MAX_SIZE);
> >>                 nc->pagecnt_bias =3D PAGE_FRAG_CACHE_MAX_SIZE;
> >>                 return va;
> >>         }
> >>
> >>         nc->va =3D NULL;
> >>         nc->size =3D 0;
> >>         return NULL;
> >> }
> >>
> >> void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
> >>                                  unsigned int fragsz, gfp_t gfp_mask,
> >>                                  unsigned int align_mask)
> >> {
> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> >>         unsigned long page_order;
> >> #endif
> >>         unsigned long page_size;
> >>         unsigned long size;
> >>         struct page *page;
> >>         void *va;
> >>
> >>         size =3D nc->size & align_mask;
> >>         va =3D nc->va;
> >> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> >>         page_order =3D (unsigned long)va & PAGE_FRAG_CACHE_ORDER_MASK;
> >>         page_size =3D PAGE_SIZE << page_order;
> >> #else
> >>         page_size =3D PAGE_SIZE;
> >> #endif
> >
> > So I notice you got rid of the loops within the function. One of the
> > reasons for structuring it the way it was is to enable better code
> > caching. By unfolding the loops you are increasing the number of
> > instructions that have to be fetched and processed in order to
> > allocate the buffers.
>
> I am not sure I understand what does 'the loops' means here, as there
> is not 'while' or 'for' here. I suppose you are referring to the 'goto'
> here?

So there was logic before that would jump to a label back at the start
of the function. It seems like you got rid of that logic and just
flattened everything out. This is likely resulting in some duplication
in the code and overall an increase in the number of instructions that
need to be fetched to allocate the fragment. As I recall one of the
reasons things were folded up the way they were was to allow it to use
a short jump instruction instead of a longer one. I suspect we may be
losing that with these changes.