Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Alexander H Duyck <alexander.duyck@gmail.com>,
	<davem@davemloft.net>, <kuba@kernel.org>, <pabeni@redhat.com>
Cc: <netdev@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>, <linux-mm@kvack.org>
Subject: Re: [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc'
Date: Wed, 3 Jul 2024 20:33:27 +0800	[thread overview]
Message-ID: <f4ff5a42-9371-bc54-8523-b11d8511c39a@huawei.com> (raw)
In-Reply-To: <a026c32218cabc7b6dc579ced1306aefd7029b10.camel@gmail.com>

On 2024/7/2 22:55, Alexander H Duyck wrote:

...

> 
>>>
>>>> +#define PAGE_FRAG_CACHE_ORDER_MASK		GENMASK(7, 0)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_BIT		BIT(8)
>>>> +#define PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT	8
>>>> +
>>>> +static inline struct encoded_va *encode_aligned_va(void *va,
>>>> +						   unsigned int order,
>>>> +						   bool pfmemalloc)
>>>> +{
>>>>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> -	__u16 offset;
>>>> -	__u16 size;
>>>> +	return (struct encoded_va *)((unsigned long)va | order |
>>>> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>>  #else
>>>> -	__u32 offset;
>>>> +	return (struct encoded_va *)((unsigned long)va |
>>>> +			pfmemalloc << PAGE_FRAG_CACHE_PFMEMALLOC_SHIFT);
>>>> +#endif
>>>> +}
>>>> +
> 
> This is missing any and all protection of the example you cited. If I
> want to pass order as a 32b value I can and I can corrupt the virtual
> address. Same thing with pfmemalloc. Lets only hope you don't hit an
> architecture where a bool is a u8 in size as otherwise that shift is
> going to wipe out the value, and if it is a u32 which is usually the
> case lets hope they stick to the values of 0 and 1.

I explicitly checked that the protection is not really needed due to
performance consideration:
1. For the 'pfmemalloc' part, it does always stick to the values of 0
   and 1 as below:
https://elixir.bootlin.com/linux/v6.10-rc6/source/Documentation/process/coding-style.rst#L1053

2. For the 'order' part, its range can only be within 0~3.

> 
>>>> +static inline unsigned long encoded_page_order(struct encoded_va *encoded_va)
>>>> +{
>>>> +#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
>>>> +	return PAGE_FRAG_CACHE_ORDER_MASK & (unsigned long)encoded_va;
>>>> +#else
>>>> +	return 0;
>>>> +#endif
>>>> +}
>>>> +
>>>> +static inline bool encoded_page_pfmemalloc(struct encoded_va *encoded_va)
>>>> +{
>>>> +	return PAGE_FRAG_CACHE_PFMEMALLOC_BIT & (unsigned long)encoded_va;
>>>> +}
>>>> +
>>>
>>> My advice is that if you just make encoded_va an unsigned long this
>>> just becomes some FIELD_GET and bit operations.
>>
>> As above.
>>
> 
> The code you mentioned had one extra block of bits that was in it and
> had strict protections on what went into and out of those bits. You
> don't have any of those protections.

As above, the protection masking/checking is explicitly avoided due
to performance consideration and reasons as above for encoded_va.

But I guess it doesn't hurt to have a VM_BUG_ON() checking to catch
possible future mistake.

> 
> I suggest you just use a long and don't bother storing this as a
> pointer.
> 

...

>>>> -
>>>> +	remaining = nc->remaining & align_mask;
>>>> +	remaining -= fragsz;
>>>> +	if (unlikely(remaining < 0)) {
>>>
>>> Now this is just getting confusing. You essentially just added an
>>> additional addition step and went back to the countdown approach I was
>>> using before except for the fact that you are starting at 0 whereas I
>>> was actually moving down through the page.
>>
>> Does the 'additional addition step' mean the additional step to calculate
>> the offset using the new 'remaining' field? I guess that is the disadvantage
>> by changing 'offset' to 'remaining', but it also some advantages too:
>>
>> 1. it is better to replace 'offset' with 'remaining', which
>>    is the remaining size for the cache in a 'page_frag_cache'
>>    instance, we are able to do a single 'fragsz > remaining'
>>    checking for the case of cache not being enough, which should be
>>    the fast path if we ensure size is zoro when 'va' == NULL by
>>    memset'ing 'struct page_frag_cache' in page_frag_cache_init()
>>    and page_frag_cache_drain().
>> 2. It seems more convenient to implement the commit/probe API too
>>    when using 'remaining' instead of 'offset' as those API needs
>>    the remaining size of the page_frag_cache anyway.
>>
>> So it is really a trade-off between using 'offset' and 'remaining',
>> it is like the similar argument about trade-off between allocating
>> fragment 'countdown' and 'countup' way.
>>
>> About confusing part, as the nc->remaining does mean how much cache
>> is left in the 'nc', and nc->remaining does start from
>> PAGE_FRAG_CACHE_MAX_SIZE/PAGE_SIZE to zero naturally if that was what
>> you meant by 'countdown', but it is different from the 'offset countdown'
>> before this patchset as my understanding.
>>
>>>
>>> What I would suggest doing since "remaining" is a negative offset
>>> anyway would be to look at just storing it as a signed negative number.
>>> At least with that you can keep to your original approach and would
>>> only have to change your check to be for "remaining + fragsz <= 0".
>>
>> Did you mean by using s16/s32 for 'remaining'? And set nc->remaining like
>> below?
>> nc->remaining = -PAGE_SIZE or
>> nc->remaining = -PAGE_FRAG_CACHE_MAX_SIZE
> 
> Yes. Basically if we used it as a signed value then you could just work
> your way up and do your aligned math still.

For the aligned math, I am not sure how can 'align_mask' be appiled to
a signed value yet. It seems that after masking nc->remaining leans
towards -PAGE_SIZE/-PAGE_FRAG_CACHE_MAX_SIZE instead of zero for
a unsigned value, for example:

nc->remaining = -4094;
nc->remaining &= -64;

It seems we got -4096 for above, is that what we are expecting?

> 
>> struct page_frag_cache {
>>         struct encoded_va *encoded_va;
>>
>> #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE) && (BITS_PER_LONG <= 32)
>>         u16 pagecnt_bias;
>>         s16 remaining;
>> #else
>>         u32 pagecnt_bias;
>>         s32 remaining;
>> #endif
>> };
>>
>> If I understand above correctly, it seems we really need a better name
>> than 'remaining' to reflect that.
> 
> It would effectively work like a countdown. Instead of T - X in this
> case it is size - X.
> 
>>> With that you can still do your math but it becomes an addition instead
>>> of a subtraction.
>>
>> And I am not really sure what is the gain here by using an addition
>> instead of a subtraction here.
>>
> 
> The "remaining" as it currently stands is doing the same thing so odds
> are it isn't too big a deal. It is just that the original code was
> already somewhat confusing and this is just making it that much more
> complex.
> 
>>>> +		page = virt_to_page(encoded_va);
>>>>  		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
>>>>  			goto refill;
>>>>  
>>>> -		if (unlikely(nc->pfmemalloc)) {
>>>> -			free_unref_page(page, compound_order(page));
>>>> +		if (unlikely(encoded_page_pfmemalloc(encoded_va))) {
>>>> +			VM_BUG_ON(compound_order(page) !=
>>>> +				  encoded_page_order(encoded_va));
>>>> +			free_unref_page(page, encoded_page_order(encoded_va));
>>>>  			goto refill;
>>>>  		}
>>>>  
>>>>  		/* OK, page count is 0, we can safely set it */
>>>>  		set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
>>>>  
>>>> -		/* reset page count bias and offset to start of new frag */
>>>> +		/* reset page count bias and remaining of new frag */
>>>>  		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
>>>> -		offset = 0;
>>>> -		if (unlikely(fragsz > PAGE_SIZE)) {
>>>> +		nc->remaining = remaining = page_frag_cache_page_size(encoded_va);
>>>> +		remaining -= fragsz;
>>>> +		if (unlikely(remaining < 0)) {
>>>>  			/*
>>>>  			 * The caller is trying to allocate a fragment
>>>>  			 * with fragsz > PAGE_SIZE but the cache isn't big
>>>
>>> I find it really amusing that you went to all the trouble of flipping
>>> the logic just to flip it back to being a countdown setup. If you were
>>> going to bother with all that then why not just make the remaining
>>> negative instead? You could save yourself a ton of trouble that way and
>>> all you would need to do is flip a few signs.
>>
>> I am not sure I understand the 'a ton of trouble' part here, if 'flipping
>> a few signs' does save a ton of trouble here, I would like to avoid 'a
>> ton of trouble' here, but I am not really understand the gain here yet as
>> mentioned above.
> 
> It isn't about flipping the signs. It is more the fact that the logic
> has now become even more complex then it originally was. With my work
> backwards approach the advantage was that the offset could be updated
> and then we just recorded everything and reported it up. Now we have to

I am supposing the above is referring to 'offset countdown' way
before this patchset, right?

> keep a temporary "remaining" value, generate our virtual address and
> store that to a temp variable, we can record the new "remaining" value,
> and then we can report the virtual address. If we get the ordering on

Yes, I noticed it when coding too, that is why current virtual address is
generated in page_frag_cache_current_va() basing on nc->remaining instead
of the local variable 'remaining' before assigning the local variable
'remaining' to nc->remaining. But I am not sure I understand how using a
signed negative number for 'remaining' will change the above steps. If
not, I still fail to see the gain of using a signed negative number for
'nc->remaining'.

> the steps 2 and 3 in this it will cause issues as we will be pointing
> to the wrong values. It is something I can see someone easily messing
> up.

Yes, agreed. It would be good to be more specific about how to avoid
the above problem using a signed negative number for 'remaining' as
I am not sure how it can be done yet.

>

next prev parent reply	other threads:[~2024-07-03 12:33 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20240625135216.47007-1-linyunsheng@huawei.com>
2024-06-25 13:52 ` [PATCH net-next v9 01/13] mm: page_frag: add a test module for page_frag Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 02/13] mm: move the page fragment allocator from page_alloc into its own file Yunsheng Lin
2024-07-01 23:10   ` Alexander H Duyck
2024-07-02 12:27     ` Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 03/13] mm: page_frag: use initial zero offset for page_frag_alloc_align() Yunsheng Lin
2024-07-01 23:27   ` Alexander H Duyck
2024-07-02 12:28     ` Yunsheng Lin
2024-07-02 16:00       ` Alexander Duyck
2024-07-03 11:25         ` Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 04/13] mm: page_frag: add '_va' suffix to page_frag API Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 05/13] mm: page_frag: avoid caller accessing 'page_frag_cache' directly Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 06/13] mm: page_frag: reuse existing space for 'size' and 'pfmemalloc' Yunsheng Lin
2024-07-02  0:08   ` Alexander H Duyck
2024-07-02 12:35     ` Yunsheng Lin
2024-07-02 14:55       ` Alexander H Duyck
2024-07-03 12:33         ` Yunsheng Lin [this message]
2024-07-10 15:28           ` Alexander H Duyck
2024-07-11  8:16             ` Yunsheng Lin
2024-07-11 16:49               ` Alexander Duyck
2024-07-12  8:42                 ` Yunsheng Lin
2024-07-12 16:55                   ` Alexander Duyck
2024-07-13  5:20                     ` Yunsheng Lin
2024-07-13 16:55                       ` Alexander Duyck
     [not found]                         ` <12ff13d9-1f3d-4c1b-a972-2efb6f247e31@gmail.com>
2024-07-15 17:55                           ` Alexander Duyck
2024-07-16 12:58                             ` Yunsheng Lin
2024-07-17 12:31                               ` Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 07/13] mm: page_frag: some minor refactoring before adding new API Yunsheng Lin
2024-07-02 15:30   ` Alexander H Duyck
2024-07-03 12:36     ` Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 08/13] mm: page_frag: use __alloc_pages() to replace alloc_pages_node() Yunsheng Lin
2024-06-25 13:52 ` [PATCH net-next v9 10/13] mm: page_frag: introduce prepare/probe/commit API Yunsheng Lin
2024-06-28 22:35   ` Alexander H Duyck
2024-06-29 11:15     ` Yunsheng Lin
2024-06-29 17:37       ` Alexander Duyck
     [not found]         ` <0a80e362-1eb7-40b0-b1b9-07ec5a6506ea@gmail.com>
2024-06-30 14:35           ` Alexander Duyck
2024-06-30 15:05             ` Yunsheng Lin
2024-07-03 12:40               ` Yunsheng Lin
2024-07-07 17:12                 ` Alexander Duyck
2024-07-08 10:58                   ` Yunsheng Lin
2024-07-08 14:30                     ` Alexander Duyck
2024-07-09  6:57                       ` Yunsheng Lin
2024-07-09 13:40                         ` Alexander Duyck
2024-06-25 13:52 ` [PATCH net-next v9 12/13] mm: page_frag: update documentation for page_frag Yunsheng Lin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f4ff5a42-9371-bc54-8523-b11d8511c39a@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox