Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yunsheng Lin <linyunsheng@huawei.com>
To: Jesper Dangaard Brouer <jbrouer@redhat.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	<netdev@vger.kernel.org>, Eric Dumazet <eric.dumazet@gmail.com>,
	<linux-mm@kvack.org>, Mel Gorman <mgorman@techsingularity.net>
Cc: brouer@redhat.com, lorenzo@kernel.org,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	bpf@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Paolo Abeni" <pabeni@redhat.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	willy@infradead.org
Subject: Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme
Date: Fri, 5 May 2023 08:54:50 +0800	[thread overview]
Message-ID: <fb8bbf84-20c2-c398-d972-949e909e2c51@huawei.com> (raw)
In-Reply-To: <3785321f-b2f8-d753-7efc-78ee40e6d0b6@redhat.com>

On 2023/5/4 21:48, Jesper Dangaard Brouer wrote:
> On 04/05/2023 04.42, Yunsheng Lin wrote:
>> On 2023/4/29 0:16, Jesper Dangaard Brouer wrote:
>>>   void page_pool_release_page(struct page_pool *pool, struct page *page)
>>>   {
>>> +    unsigned int flags = READ_ONCE(pool->p.flags);
>>>       dma_addr_t dma;
>>> -    int count;
>>> +    u32 release_cnt;
>>> +    u32 hold_cnt;
>>>         if (!(pool->p.flags & PP_FLAG_DMA_MAP))
>>>           /* Always account for inflight pages, even if we didn't
>>> @@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>>>   skip_dma_unmap:
>>>       page_pool_clear_pp_info(page);
>>>   -    /* This may be the last page returned, releasing the pool, so
>>> -     * it is not safe to reference pool afterwards.
>>> -     */
>>> -    count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>>> -    trace_page_pool_state_release(pool, page, count);
>>
>> There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)"
>> and flags checking, if page_pool_destroy() is called concurrently during that
>> time window, it seems we will have a pp instance leaking problem here?
>>
> 
> Nope, that is resolved by the code changes in page_pool_destroy(), see below.

Maybe I did not describe the data race clearly enough.

         CPU 0                                               CPU1
	   .
	   .
unsigned int flags = READ_ONCE(pool->p.flags);
	   .
	   .                                         page_pool_destroy()
	   .
atomic_inc_return(&pool->pages_state_release_cnt)
           .
	   .
	   .
 if (flags & PP_FLAG_SHUTDOWN)
	page_pool_free_attempt();

The above data race may cause a pp instance leaking problem:
CPU0 is releasing the last page for a pp and it did not see the pool->p.flags
with the PP_FLAG_SHUTDOWN set because page_pool_destroy() is called after
reading pool->p.flags, so page_pool_free_attempt() is not called to free
pp.

CPU1 calling the page_pool_destroy() also did not free pp as CPU0 had not
done the atomic_inc_return() for pool->pages_state_release_cnt yet.

Or did I miss something obvious here?

> 
>> It seems it is very hard to aovid this kind of corner case when using both
>> flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp
>> instance can be freed.
>> Can we use something like biased reference counting, which used by frag support
>> in page pool? So that we only need to check only one variable and avoid cache
>> bouncing as much as possible.
>>
> 
> See below, I believe we are doing an equivalent refcnt bias trick, that
> solves these corner cases in page_pool_destroy().
> In short: hold_cnt is increased, prior to setting PP_FLAG_SHUTDOWN.
> Thus, if this code READ_ONCE flags without PP_FLAG_SHUTDOWN, we know it
> will not be the last to release pool->pages_state_release_cnt.

It is not exactly the kind of refcnt bias trick in my mind, I was thinking
about using pool->pages_state_hold_cnt as refcnt bias and merge it to
pool->pages_state_release_cnt as needed, maybe I need to try to implement
that to see if it turn out to be what I want it to be.

> Below: Perhaps, we should add a RCU grace period to make absolutely
> sure, that this code completes before page_pool_destroy() call completes.
> 
> 
>>> +    if (flags & PP_FLAG_SHUTDOWN)
>>> +        hold_cnt = pp_read_hold_cnt(pool);
>>> +
> 
> I would like to avoid above code, and I'm considering using call_rcu(),
> which I think will resolve the race[0] this code deals with.
> As I explained here[0], this code deals with another kind of race.

Yes, I understand that. I even went to check if the below tracepoint
trace_page_pool_state_release() was causing a use-after-free problem
as it is passing 'pool':)

> 
>  [0] https://lore.kernel.org/all/f671f5da-d9bc-a559-2120-10c3491e6f6d@redhat.com/
> 
>>> +    release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
>>> +    trace_page_pool_state_release(pool, page, release_cnt);
>>> +
>>> +    /* In shutdown phase, last page will free pool instance */
>>> +    if (flags & PP_FLAG_SHUTDOWN)
>>> +        page_pool_free_attempt(pool, hold_cnt, release_cnt);
>>>   }
>>>   EXPORT_SYMBOL(page_pool_release_page);
>>>
>>
>> ...
>>
>>>     void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
>>> @@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
>>>     void page_pool_destroy(struct page_pool *pool)
>>>   {
>>> +    unsigned int flags;
>>> +    u32 release_cnt;
>>> +    u32 hold_cnt;
>>> +
>>>       if (!pool)
>>>           return;
>>>   @@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool)
>>>       if (!page_pool_release(pool))
>>>           return;
>>>   -    pool->defer_start = jiffies;
>>> -    pool->defer_warn  = jiffies + DEFER_WARN_INTERVAL;
>>> +    /* PP have pages inflight, thus cannot immediately release memory.
>>> +     * Enter into shutdown phase, depending on remaining in-flight PP
>>> +     * pages to trigger shutdown process (on concurrent CPUs) and last
>>> +     * page will free pool instance.
>>> +     *
>>> +     * There exist two race conditions here, we need to take into
>>> +     * account in the following code.
>>> +     *
>>> +     * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +     *    pages into the ptr_ring.  Thus, it missed triggering shutdown
>>> +     *    process, which can then be stalled forever.
>>> +     *
>>> +     * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>>> +     *    page, which triggered shutdown process and freed pool
>>> +     *    instance. Thus, its not safe to dereference *pool afterwards.
>>> +     *
>>> +     * Handling races by holding a fake in-flight count, via
>>> +     * artificially bumping pages_state_hold_cnt, which assures pool
>>> +     * isn't freed under us.  For race(1) its safe to recheck ptr_ring
>>> +     * (it will not free pool). Race(2) cannot happen, and we can
>>> +     * release fake in-flight count as last step.
>>> +     */
>>> +    hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>>> +    smp_store_release(&pool->pages_state_hold_cnt, hold_cnt);
>>
>> I assume the smp_store_release() is used to ensure the correct order
>> between the above store operations?
>> There is data dependency between those two store operations, do we
>> really need the smp_store_release() here?
>>
>>> +    barrier();
>>> +    flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>>
>> Do we need a stronger barrier like smp_rmb() to prevent cpu from
>> executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN"
>> before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)"
>> even if there is a smp_store_release() barrier here?
>>
> I do see you point and how it is related to your above comment for
> page_pool_release_page().
> 
> I think we need to replace barrier() with synchronize_rcu().
> Meaning we add a RCU grace period to "wait" for above code (in
> page_pool_release_page) that read the old flags value to complete.
> 
> 
>>> +    smp_store_release(&pool->p.flags, flags);
> 
> When doing a synchronize_rcu(), I assume this smp_store_release() is
> overkill, right?
> Will a WRITE_ONCE() be sufficient?
> 
> Hmm, the synchronize_rcu(), shouldn't that be *after* storing the flags?

Yes.
As my understanding, we probably do not need any of those *_ONCE() and
barrier when using rcu.

But I am not really convinced that we need to go for rcu yet.

next prev parent reply	other threads:[~2023-05-05  0:54 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-28 16:16 [PATCH RFC net-next/mm V3 0/2] page_pool: new approach for leak detection and shutdown phase Jesper Dangaard Brouer
2023-04-28 16:16 ` [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
2023-04-28 21:38   ` Toke Høiland-Jørgensen
2023-05-03 15:21     ` Jesper Dangaard Brouer
2023-05-03  2:33   ` Jakub Kicinski
2023-05-03 11:18     ` Toke Høiland-Jørgensen
2023-05-03 15:49       ` Jesper Dangaard Brouer
2023-05-04  1:47         ` Jakub Kicinski
2023-05-04  2:42   ` Yunsheng Lin
2023-05-04 13:48     ` Jesper Dangaard Brouer
2023-05-05  0:54       ` Yunsheng Lin [this message]
2023-05-06 13:11         ` Yunsheng Lin
2023-04-28 16:16 ` [PATCH RFC net-next/mm V3 2/2] mm/page_pool: catch page_pool memory leaks Jesper Dangaard Brouer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=fb8bbf84-20c2-c398-d972-949e909e2c51@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=davem@davemloft.net \
    --cc=eric.dumazet@gmail.com \
    --cc=ilias.apalodimas@linaro.org \
    --cc=jbrouer@redhat.com \
    --cc=kuba@kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo@kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=toke@redhat.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox