From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Yunsheng Lin <linyunsheng@huawei.com>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
netdev@vger.kernel.org, Eric Dumazet <eric.dumazet@gmail.com>,
linux-mm@kvack.org, Mel Gorman <mgorman@techsingularity.net>
Cc: brouer@redhat.com, lorenzo@kernel.org,
"Toke Høiland-Jørgensen" <toke@redhat.com>,
bpf@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
"Jakub Kicinski" <kuba@kernel.org>,
"Paolo Abeni" <pabeni@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
willy@infradead.org
Subject: Re: [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme
Date: Thu, 4 May 2023 15:48:10 +0200 [thread overview]
Message-ID: <3785321f-b2f8-d753-7efc-78ee40e6d0b6@redhat.com> (raw)
In-Reply-To: <387f4653-1986-3ffe-65e7-448a59002ed0@huawei.com>
On 04/05/2023 04.42, Yunsheng Lin wrote:
> On 2023/4/29 0:16, Jesper Dangaard Brouer wrote:
>> void page_pool_release_page(struct page_pool *pool, struct page *page)
>> {
>> + unsigned int flags = READ_ONCE(pool->p.flags);
>> dma_addr_t dma;
>> - int count;
>> + u32 release_cnt;
>> + u32 hold_cnt;
>>
>> if (!(pool->p.flags & PP_FLAG_DMA_MAP))
>> /* Always account for inflight pages, even if we didn't
>> @@ -490,11 +503,15 @@ void page_pool_release_page(struct page_pool *pool, struct page *page)
>> skip_dma_unmap:
>> page_pool_clear_pp_info(page);
>>
>> - /* This may be the last page returned, releasing the pool, so
>> - * it is not safe to reference pool afterwards.
>> - */
>> - count = atomic_inc_return_relaxed(&pool->pages_state_release_cnt);
>> - trace_page_pool_state_release(pool, page, count);
>
> There is a time window between "unsigned int flags = READ_ONCE(pool->p.flags)"
> and flags checking, if page_pool_destroy() is called concurrently during that
> time window, it seems we will have a pp instance leaking problem here?
>
Nope, that is resolved by the code changes in page_pool_destroy(), see
below.
> It seems it is very hard to aovid this kind of corner case when using both
> flags & PP_FLAG_SHUTDOWN and release_cnt/hold_cnt checking to decide if pp
> instance can be freed.
> Can we use something like biased reference counting, which used by frag support
> in page pool? So that we only need to check only one variable and avoid cache
> bouncing as much as possible.
>
See below, I believe we are doing an equivalent refcnt bias trick, that
solves these corner cases in page_pool_destroy().
In short: hold_cnt is increased, prior to setting PP_FLAG_SHUTDOWN.
Thus, if this code READ_ONCE flags without PP_FLAG_SHUTDOWN, we know it
will not be the last to release pool->pages_state_release_cnt.
Below: Perhaps, we should add a RCU grace period to make absolutely
sure, that this code completes before page_pool_destroy() call completes.
>> + if (flags & PP_FLAG_SHUTDOWN)
>> + hold_cnt = pp_read_hold_cnt(pool);
>> +
I would like to avoid above code, and I'm considering using call_rcu(),
which I think will resolve the race[0] this code deals with.
As I explained here[0], this code deals with another kind of race.
[0]
https://lore.kernel.org/all/f671f5da-d9bc-a559-2120-10c3491e6f6d@redhat.com/
>> + release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
>> + trace_page_pool_state_release(pool, page, release_cnt);
>> +
>> + /* In shutdown phase, last page will free pool instance */
>> + if (flags & PP_FLAG_SHUTDOWN)
>> + page_pool_free_attempt(pool, hold_cnt, release_cnt);
>> }
>> EXPORT_SYMBOL(page_pool_release_page);
>>
>
> ...
>
>>
>> void page_pool_use_xdp_mem(struct page_pool *pool, void (*disconnect)(void *),
>> @@ -856,6 +884,10 @@ EXPORT_SYMBOL(page_pool_unlink_napi);
>>
>> void page_pool_destroy(struct page_pool *pool)
>> {
>> + unsigned int flags;
>> + u32 release_cnt;
>> + u32 hold_cnt;
>> +
>> if (!pool)
>> return;
>>
>> @@ -868,11 +900,39 @@ void page_pool_destroy(struct page_pool *pool)
>> if (!page_pool_release(pool))
>> return;
>>
>> - pool->defer_start = jiffies;
>> - pool->defer_warn = jiffies + DEFER_WARN_INTERVAL;
>> + /* PP have pages inflight, thus cannot immediately release memory.
>> + * Enter into shutdown phase, depending on remaining in-flight PP
>> + * pages to trigger shutdown process (on concurrent CPUs) and last
>> + * page will free pool instance.
>> + *
>> + * There exist two race conditions here, we need to take into
>> + * account in the following code.
>> + *
>> + * 1. Before setting PP_FLAG_SHUTDOWN another CPU released the last
>> + * pages into the ptr_ring. Thus, it missed triggering shutdown
>> + * process, which can then be stalled forever.
>> + *
>> + * 2. After setting PP_FLAG_SHUTDOWN another CPU released the last
>> + * page, which triggered shutdown process and freed pool
>> + * instance. Thus, its not safe to dereference *pool afterwards.
>> + *
>> + * Handling races by holding a fake in-flight count, via
>> + * artificially bumping pages_state_hold_cnt, which assures pool
>> + * isn't freed under us. For race(1) its safe to recheck ptr_ring
>> + * (it will not free pool). Race(2) cannot happen, and we can
>> + * release fake in-flight count as last step.
>> + */
>> + hold_cnt = READ_ONCE(pool->pages_state_hold_cnt) + 1;
>> + smp_store_release(&pool->pages_state_hold_cnt, hold_cnt);
>
> I assume the smp_store_release() is used to ensure the correct order
> between the above store operations?
> There is data dependency between those two store operations, do we
> really need the smp_store_release() here?
>
>> + barrier();
>> + flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN;
>
> Do we need a stronger barrier like smp_rmb() to prevent cpu from
> executing "flags = READ_ONCE(pool->p.flags) | PP_FLAG_SHUTDOWN"
> before "smp_store_release(&pool->pages_state_hold_cnt, hold_cnt)"
> even if there is a smp_store_release() barrier here?
>
I do see you point and how it is related to your above comment for
page_pool_release_page().
I think we need to replace barrier() with synchronize_rcu().
Meaning we add a RCU grace period to "wait" for above code (in
page_pool_release_page) that read the old flags value to complete.
>> + smp_store_release(&pool->p.flags, flags);
When doing a synchronize_rcu(), I assume this smp_store_release() is
overkill, right?
Will a WRITE_ONCE() be sufficient?
Hmm, the synchronize_rcu(), shouldn't that be *after* storing the flags?
>> +
>> + /* Concurrent CPUs could have returned last pages into ptr_ring */
>> + page_pool_empty_ring(pool);
>>
>> - INIT_DELAYED_WORK(&pool->release_dw, page_pool_release_retry);
>> - schedule_delayed_work(&pool->release_dw, DEFER_TIME);
>> + release_cnt = atomic_inc_return(&pool->pages_state_release_cnt);
>> + page_pool_free_attempt(pool, hold_cnt, release_cnt);
>> }
>> EXPORT_SYMBOL(page_pool_destroy);
next prev parent reply other threads:[~2023-05-04 13:48 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-04-28 16:16 [PATCH RFC net-next/mm V3 0/2] page_pool: new approach for leak detection and shutdown phase Jesper Dangaard Brouer
2023-04-28 16:16 ` [PATCH RFC net-next/mm V3 1/2] page_pool: Remove workqueue in new shutdown scheme Jesper Dangaard Brouer
2023-04-28 21:38 ` Toke Høiland-Jørgensen
2023-05-03 15:21 ` Jesper Dangaard Brouer
2023-05-03 2:33 ` Jakub Kicinski
2023-05-03 11:18 ` Toke Høiland-Jørgensen
2023-05-03 15:49 ` Jesper Dangaard Brouer
2023-05-04 1:47 ` Jakub Kicinski
2023-05-04 2:42 ` Yunsheng Lin
2023-05-04 13:48 ` Jesper Dangaard Brouer [this message]
2023-05-05 0:54 ` Yunsheng Lin
2023-05-06 13:11 ` Yunsheng Lin
2023-04-28 16:16 ` [PATCH RFC net-next/mm V3 2/2] mm/page_pool: catch page_pool memory leaks Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3785321f-b2f8-d753-7efc-78ee40e6d0b6@redhat.com \
--to=jbrouer@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bpf@vger.kernel.org \
--cc=brouer@redhat.com \
--cc=davem@davemloft.net \
--cc=eric.dumazet@gmail.com \
--cc=ilias.apalodimas@linaro.org \
--cc=kuba@kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyunsheng@huawei.com \
--cc=lorenzo@kernel.org \
--cc=mgorman@techsingularity.net \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@redhat.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox