[PATCH RFC v4 0/3] fix two bugs related to page

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC v4 0/3] fix two bugs related to page_pool
@ 2024-11-20 10:34 Yunsheng Lin
  2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
  0 siblings, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2024-11-20 10:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Yunsheng Lin,
	Alexander Lobakin, Robin Murphy, Alexander Duyck, Andrew Morton,
	IOMMU, MM

Patch 1 fix a possible time window problem for page_pool.
Patch 2 fix the kernel crash problem at iommu_get_dma_domain reported
in [1] using scanning.
Patch 3 avoid calling dma sync API after driver has already unbound.

From the below performance data, there seems to be no noticeable
performance impact.

Before this patchset:
root@(none)$ taskset -c 0 insmod bench_page_pool_simple.ko
[  165.357058] bench_page_pool_simple: Loaded
[  165.438159] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076973110 sec time_interval:76973110) - (invoke count:100000000 tsc_interval:7697296)
[  167.423811] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 19.683 ns (step:0) - (measurement period time:1.968369340 sec time_interval:1968369340) - (invoke count:100000000 tsc_interval:196836926)
[  167.591773] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.006 ns (step:0) - (measurement period time:0.150069980 sec time_interval:150069980) - (invoke count:10000000 tsc_interval:15006991)
[  168.265447] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.565 ns (step:0) - (measurement period time:0.656564890 sec time_interval:656564890) - (invoke count:100000000 tsc_interval:65656477)
[  168.282469] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  168.572734] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 28.097 ns (step:0) - (measurement period time:0.280971960 sec time_interval:280971960) - (invoke count:10000000 tsc_interval:28097187)
[  168.591404] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  169.178662] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 57.805 ns (step:0) - (measurement period time:0.578052550 sec time_interval:578052550) - (invoke count:10000000 tsc_interval:57805246)
[  169.197331] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  171.033303] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 182.711 ns (step:0) - (measurement period time:1.827113580 sec time_interval:1827113580) - (invoke count:10000000 tsc_interval:182711348)
[  171.052324] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  171.060227] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  171.350242] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 28.089 ns (step:0) - (measurement period time:0.280896430 sec time_interval:280896430) - (invoke count:10000000 tsc_interval:28089636)
[  171.369517] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  171.903169] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 52.461 ns (step:0) - (measurement period time:0.524619700 sec time_interval:524619700) - (invoke count:10000000 tsc_interval:52461966)
[  171.922357] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  173.851219] time_bench: Type:tasklet_page_pool03_slow Per elem: 19 cycles(tsc) 192.017 ns (step:0) - (measurement period time:1.920178560 sec time_interval:1920178560) - (invoke count:10000000 tsc_interval:192017848)

After this patchset:
root@(none)$ taskset -c 0 insmod bench_page_pool_simple.ko
[  394.337302] bench_page_pool_simple: Loaded
[  394.418402] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.769 ns (step:0) - (measurement period time:0.076976830 sec time_interval:76976830) - (invoke count:100000000 tsc_interval:7697673)
[  396.168990] time_bench: Type:atomic_inc Per elem: 1 cycles(tsc) 17.333 ns (step:0) - (measurement period time:1.733304770 sec time_interval:1733304770) - (invoke count:100000000 tsc_interval:173330470)
[  396.336932] time_bench: Type:lock Per elem: 1 cycles(tsc) 15.005 ns (step:0) - (measurement period time:0.150052930 sec time_interval:150052930) - (invoke count:10000000 tsc_interval:15005288)
[  397.008173] time_bench: Type:rcu Per elem: 0 cycles(tsc) 6.541 ns (step:0) - (measurement period time:0.654135460 sec time_interval:654135460) - (invoke count:100000000 tsc_interval:65413540)
[  397.025193] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path
[  397.295761] time_bench: Type:no-softirq-page_pool01 Per elem: 2 cycles(tsc) 26.127 ns (step:0) - (measurement period time:0.261275610 sec time_interval:261275610) - (invoke count:10000000 tsc_interval:26127555)
[  397.314429] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path
[  397.852216] time_bench: Type:no-softirq-page_pool02 Per elem: 5 cycles(tsc) 52.858 ns (step:0) - (measurement period time:0.528581530 sec time_interval:528581530) - (invoke count:10000000 tsc_interval:52858146)
[  397.870887] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path
[  399.701260] time_bench: Type:no-softirq-page_pool03 Per elem: 18 cycles(tsc) 182.151 ns (step:0) - (measurement period time:1.821514450 sec time_interval:1821514450) - (invoke count:10000000 tsc_interval:182151437)
[  399.720282] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path
[  399.728186] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path
[  399.998947] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 2 cycles(tsc) 26.164 ns (step:0) - (measurement period time:0.261642940 sec time_interval:261642940) - (invoke count:10000000 tsc_interval:26164289)
[  400.018223] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path
[  400.621035] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 5 cycles(tsc) 59.377 ns (step:0) - (measurement period time:0.593779950 sec time_interval:593779950) - (invoke count:10000000 tsc_interval:59377988)
[  400.640223] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path
[  402.524760] time_bench: Type:tasklet_page_pool03_slow Per elem: 18 cycles(tsc) 187.585 ns (step:0) - (measurement period time:1.875853550 sec time_interval:1875853550) - (invoke count:10000000 tsc_interval:187585349)

1. https://lore.kernel.org/lkml/8067f204-1380-4d37-8ffd-007fc6f26738@kernel.org/T/

CC: Alexander Lobakin <aleksander.lobakin@intel.com>
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>

Change log:
V4:
  1. use scanning to do the unmapping
  2. spilt dma sync skipping into separate patch

V3:
  1. Target net-next tree instead of net tree.
  2. Narrow the rcu lock as the discussion in v2.
  3. Check the ummapping cnt against the inflight cnt.

V2:
  1. Add a item_full stat.
  2. Use container_of() for page_pool_to_pp().

Yunsheng Lin (3):
  page_pool: fix timing for checking and disabling napi_local
  page_pool: fix IOMMU crash when driver has already unbound
  page_pool: skip dma sync operation for inflight pages

 include/net/page_pool/types.h |   6 +-
 net/core/page_pool.c          | 135 ++++++++++++++++++++++++++++------
 2 files changed, 119 insertions(+), 22 deletions(-)

-- 
2.33.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-20 10:34 [PATCH RFC v4 0/3] fix two bugs related to page_pool Yunsheng Lin
@ 2024-11-20 10:34 ` Yunsheng Lin
  2024-11-20 16:17   ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2024-11-20 10:34 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Yunsheng Lin, Robin Murphy,
	Alexander Duyck, Andrew Morton, IOMMU, MM,
	Jesper Dangaard Brouer, Ilias Apalodimas, Eric Dumazet,
	Simon Horman, netdev, linux-kernel

Skip dma sync operation for inflight pages before the
page_pool_destroy() returns to the driver as DMA API
expects to be called with a valid device bound to a
driver as mentioned in [1].

After page_pool_destroy() is called, the page is not
expected to be recycled back to pool->alloc cache and
dma sync operation is not needed when the page is not
recyclable or pool->ring is full, so only skip the dma
sync operation for the infilght pages by clearing the
pool->dma_sync under protection of rcu lock when page
is recycled to pool->ring to ensure that there is no
dma sync operation called after page_pool_destroy() is
returned.

1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
CC: Robin Murphy <robin.murphy@arm.com>
CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: IOMMU <iommu@lists.linux.dev>
CC: MM <linux-mm@kvack.org>
Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 net/core/page_pool.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 33a314abbba4..0bde7c6c781a 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -712,7 +712,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
 	rcu_read_unlock();
 }
 
-static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
+static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
+				      unsigned int dma_sync_size)
 {
 	int ret;
 	/* BH protection not needed if current is softirq */
@@ -723,10 +724,13 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
 
 	if (!ret) {
 		recycle_stat_inc(pool, ring);
-		return true;
+
+		rcu_read_lock();
+		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
+		rcu_read_unlock();
 	}
 
-	return false;
+	return !ret;
 }
 
 /* Only allow direct recycling in special circumstances, into the
@@ -779,10 +783,11 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
 	if (likely(__page_pool_page_can_be_recycled(netmem))) {
 		/* Read barrier done in page_ref_count / READ_ONCE */
 
-		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
-
-		if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
+		if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
+			page_pool_dma_sync_for_device(pool, netmem,
+						      dma_sync_size);
 			return 0;
+		}
 
 		/* Page found as candidate for recycling */
 		return netmem;
@@ -845,7 +850,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
 
 	netmem =
 		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
-	if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
+	if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
 		/* Cache full, fallback to free pages */
 		recycle_stat_inc(pool, ring_full);
 		page_pool_return_page(pool, netmem);
@@ -903,14 +908,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
 
 	/* Bulk producer into ptr_ring page_pool cache */
 	in_softirq = page_pool_producer_lock(pool);
+	rcu_read_lock();
 	for (i = 0; i < bulk_len; i++) {
 		if (__ptr_ring_produce(&pool->ring, data[i])) {
 			/* ring full */
 			recycle_stat_inc(pool, ring_full);
 			break;
 		}
+		page_pool_dma_sync_for_device(pool, (__force netmem_ref)data[i],
+					      -1);
 	}
 	recycle_stat_add(pool, ring, i);
+	rcu_read_unlock();
 	page_pool_producer_unlock(pool, in_softirq);
 
 	/* Hopefully all pages was return into ptr_ring */
@@ -1200,6 +1209,8 @@ void page_pool_destroy(struct page_pool *pool)
 	if (!page_pool_release(pool))
 		return;
 
+	pool->dma_sync = false;
+
 	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
 	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
 	 * before returning to driver to free the napi instance.
-- 
2.33.0



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
@ 2024-11-20 16:17   ` Robin Murphy
  2024-11-21  8:04     ` Yunsheng Lin
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2024-11-20 16:17 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 20/11/2024 10:34 am, Yunsheng Lin wrote:
> Skip dma sync operation for inflight pages before the
> page_pool_destroy() returns to the driver as DMA API
> expects to be called with a valid device bound to a
> driver as mentioned in [1].
> 
> After page_pool_destroy() is called, the page is not
> expected to be recycled back to pool->alloc cache and
> dma sync operation is not needed when the page is not
> recyclable or pool->ring is full, so only skip the dma
> sync operation for the infilght pages by clearing the
> pool->dma_sync under protection of rcu lock when page
> is recycled to pool->ring to ensure that there is no
> dma sync operation called after page_pool_destroy() is
> returned.

Something feels off here - either this is a micro-optimisation which I 
wouldn't really expect to be meaningful, or it means patch #2 doesn't 
actually do what it claims. If it really is possible to attempt to 
dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed 
and unmapped it, that represents yet another DMA API lifecycle issue, 
which as well as being even more obviously incorrect usage-wise, could 
also still lead to the same crash (if the device is non-coherent).

Otherwise, I don't imagine it's really worth worrying about optimising 
out syncs for any pages which happen to get naturally returned after 
page_pool_destroy() starts but before they're explicitly reclaimed. 
Realistically, the kinds of big server systems where reclaim takes an 
appreciable amount of time are going to be coherent and skipping syncs 
anyway.

Thanks,
Robin.

> 1. https://lore.kernel.org/all/caf31b5e-0e8f-4844-b7ba-ef59ed13b74e@arm.com/
> CC: Robin Murphy <robin.murphy@arm.com>
> CC: Alexander Duyck <alexander.duyck@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: IOMMU <iommu@lists.linux.dev>
> CC: MM <linux-mm@kvack.org>
> Fixes: f71fec47c2df ("page_pool: make sure struct device is stable")
> Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
> ---
>   net/core/page_pool.c | 25 ++++++++++++++++++-------
>   1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 33a314abbba4..0bde7c6c781a 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -712,7 +712,8 @@ static void page_pool_return_page(struct page_pool *pool, netmem_ref netmem)
>   	rcu_read_unlock();
>   }
>   
> -static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
> +static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem,
> +				      unsigned int dma_sync_size)
>   {
>   	int ret;
>   	/* BH protection not needed if current is softirq */
> @@ -723,10 +724,13 @@ static bool page_pool_recycle_in_ring(struct page_pool *pool, netmem_ref netmem)
>   
>   	if (!ret) {
>   		recycle_stat_inc(pool, ring);
> -		return true;
> +
> +		rcu_read_lock();
> +		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> +		rcu_read_unlock();
>   	}
>   
> -	return false;
> +	return !ret;
>   }
>   
>   /* Only allow direct recycling in special circumstances, into the
> @@ -779,10 +783,11 @@ __page_pool_put_page(struct page_pool *pool, netmem_ref netmem,
>   	if (likely(__page_pool_page_can_be_recycled(netmem))) {
>   		/* Read barrier done in page_ref_count / READ_ONCE */
>   
> -		page_pool_dma_sync_for_device(pool, netmem, dma_sync_size);
> -
> -		if (allow_direct && page_pool_recycle_in_cache(netmem, pool))
> +		if (allow_direct && page_pool_recycle_in_cache(netmem, pool)) {
> +			page_pool_dma_sync_for_device(pool, netmem,
> +						      dma_sync_size);
>   			return 0;
> +		}
>   
>   		/* Page found as candidate for recycling */
>   		return netmem;
> @@ -845,7 +850,7 @@ void page_pool_put_unrefed_netmem(struct page_pool *pool, netmem_ref netmem,
>   
>   	netmem =
>   		__page_pool_put_page(pool, netmem, dma_sync_size, allow_direct);
> -	if (netmem && !page_pool_recycle_in_ring(pool, netmem)) {
> +	if (netmem && !page_pool_recycle_in_ring(pool, netmem, dma_sync_size)) {
>   		/* Cache full, fallback to free pages */
>   		recycle_stat_inc(pool, ring_full);
>   		page_pool_return_page(pool, netmem);
> @@ -903,14 +908,18 @@ void page_pool_put_page_bulk(struct page_pool *pool, void **data,
>   
>   	/* Bulk producer into ptr_ring page_pool cache */
>   	in_softirq = page_pool_producer_lock(pool);
> +	rcu_read_lock();
>   	for (i = 0; i < bulk_len; i++) {
>   		if (__ptr_ring_produce(&pool->ring, data[i])) {
>   			/* ring full */
>   			recycle_stat_inc(pool, ring_full);
>   			break;
>   		}
> +		page_pool_dma_sync_for_device(pool, (__force netmem_ref)data[i],
> +					      -1);
>   	}
>   	recycle_stat_add(pool, ring, i);
> +	rcu_read_unlock();
>   	page_pool_producer_unlock(pool, in_softirq);
>   
>   	/* Hopefully all pages was return into ptr_ring */
> @@ -1200,6 +1209,8 @@ void page_pool_destroy(struct page_pool *pool)
>   	if (!page_pool_release(pool))
>   		return;
>   
> +	pool->dma_sync = false;
> +
>   	/* Paired with rcu lock in page_pool_napi_local() to enable clearing
>   	 * of pool->p.napi in page_pool_disable_direct_recycling() is seen
>   	 * before returning to driver to free the napi instance.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-20 16:17   ` Robin Murphy
@ 2024-11-21  8:04     ` Yunsheng Lin
  2024-11-21 13:44       ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Yunsheng Lin @ 2024-11-21  8:04 UTC (permalink / raw)
  To: Robin Murphy, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/11/21 0:17, Robin Murphy wrote:
> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>> Skip dma sync operation for inflight pages before the
>> page_pool_destroy() returns to the driver as DMA API
>> expects to be called with a valid device bound to a
>> driver as mentioned in [1].
>>
>> After page_pool_destroy() is called, the page is not
>> expected to be recycled back to pool->alloc cache and
>> dma sync operation is not needed when the page is not
>> recyclable or pool->ring is full, so only skip the dma
>> sync operation for the infilght pages by clearing the
>> pool->dma_sync under protection of rcu lock when page
>> is recycled to pool->ring to ensure that there is no
>> dma sync operation called after page_pool_destroy() is
>> returned.
> 
> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).

For a page_pool owned page, it mostly goes through the below steps:
1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
   and sync_for_device API for it if its pool is empty. Or reuse the page in
   pool.

2. Driver calls the page_pool API to allocate the page, and pass the page
   to network stack after packet is dma'ed into the page and the sync_for_cpu
   API is called.

3. Network stack is done with page and called page_pool API to free the page.

4. page_pool releases the page back to buddy allocator if the page is not
   recyclable before doing the dma unmaping. Or do the sync_for_device
   and put the page in the its pool, the page might go through step 1
   again if the driver calls the page_pool allocate API.

The calling of dma mapping and dma sync API is controlled by pool->dma_map
and pool->dma_sync respectively, the previous patch only clear pool->dma_map
after doing the dma unmapping. This patch ensures that there is no dma_sync
for recycle case of step 4 by clearing pool->dma_sync.

The dma_sync skipping should also happen before page_pool_inflight_unmap()
is called too because all the caller will see the clearing of pool->dma_sync
after synchronize_rcu() and page_pool_inflight_unmap() is called after
the same synchronize_rcu() in page_pool_destroy().

> 
> Otherwise, I don't imagine it's really worth worrying about optimising out syncs for any pages which happen to get naturally returned after page_pool_destroy() starts but before they're explicitly reclaimed. Realistically, the kinds of big server systems where reclaim takes an appreciable amount of time are going to be coherent and skipping syncs anyway.

The skipping is about skipping the dma sync for those inflight pages,
I should make it clearer that the skipping happens before the calling
of page_pool_inflight_unmap() instead of page_pool_destroy() in the
commit log.

> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-21  8:04     ` Yunsheng Lin
@ 2024-11-21 13:44       ` Robin Murphy
  2024-11-22  7:20         ` Yunsheng Lin
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2024-11-21 13:44 UTC (permalink / raw)
  To: Yunsheng Lin, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 21/11/2024 8:04 am, Yunsheng Lin wrote:
> On 2024/11/21 0:17, Robin Murphy wrote:
>> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>>> Skip dma sync operation for inflight pages before the
>>> page_pool_destroy() returns to the driver as DMA API
>>> expects to be called with a valid device bound to a
>>> driver as mentioned in [1].
>>>
>>> After page_pool_destroy() is called, the page is not
>>> expected to be recycled back to pool->alloc cache and
>>> dma sync operation is not needed when the page is not
>>> recyclable or pool->ring is full, so only skip the dma
>>> sync operation for the infilght pages by clearing the
>>> pool->dma_sync under protection of rcu lock when page
>>> is recycled to pool->ring to ensure that there is no
>>> dma sync operation called after page_pool_destroy() is
>>> returned.
>>
>> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).
> 
> For a page_pool owned page, it mostly goes through the below steps:
> 1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
>     and sync_for_device API for it if its pool is empty. Or reuse the page in
>     pool.
> 
> 2. Driver calls the page_pool API to allocate the page, and pass the page
>     to network stack after packet is dma'ed into the page and the sync_for_cpu
>     API is called.
> 
> 3. Network stack is done with page and called page_pool API to free the page.
> 
> 4. page_pool releases the page back to buddy allocator if the page is not
>     recyclable before doing the dma unmaping. Or do the sync_for_device
>     and put the page in the its pool, the page might go through step 1
>     again if the driver calls the page_pool allocate API.
> 
> The calling of dma mapping and dma sync API is controlled by pool->dma_map
> and pool->dma_sync respectively, the previous patch only clear pool->dma_map
> after doing the dma unmapping. This patch ensures that there is no dma_sync
> for recycle case of step 4 by clearing pool->dma_sync.

But *why* does it want to ensure that? Is there some possible race where 
one thread can attempt to sync and recycle a page while another thread 
is attempting to unmap and free it, such that you can't guarantee the 
correctness of dma_sync calls after page_pool_inflight_unmap() has 
started, and skipping them is a workaround for that? If so, then frankly 
I think that would want solving properly, but at the very least this 
change would need to come before patch #2.

If not, and this is just some attempt at performance micro-optimisation, 
then I'd be keen to see the numbers to justify it, since I struggle to 
imagine it being worth the bother while already in the process of 
spending whole seconds scanning memory...

Thanks,
Robin.

> The dma_sync skipping should also happen before page_pool_inflight_unmap()
> is called too because all the caller will see the clearing of pool->dma_sync
> after synchronize_rcu() and page_pool_inflight_unmap() is called after
> the same synchronize_rcu() in page_pool_destroy().
> 
>>
>> Otherwise, I don't imagine it's really worth worrying about optimising out syncs for any pages which happen to get naturally returned after page_pool_destroy() starts but before they're explicitly reclaimed. Realistically, the kinds of big server systems where reclaim takes an appreciable amount of time are going to be coherent and skipping syncs anyway.
> 
> The skipping is about skipping the dma sync for those inflight pages,
> I should make it clearer that the skipping happens before the calling
> of page_pool_inflight_unmap() instead of page_pool_destroy() in the
> commit log.
> 
>>
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages
  2024-11-21 13:44       ` Robin Murphy
@ 2024-11-22  7:20         ` Yunsheng Lin
  0 siblings, 0 replies; 6+ messages in thread
From: Yunsheng Lin @ 2024-11-22  7:20 UTC (permalink / raw)
  To: Robin Murphy, davem, kuba, pabeni
  Cc: liuyonglong, fanghaiqing, zhangkun09, Alexander Duyck,
	Andrew Morton, IOMMU, MM, Jesper Dangaard Brouer,
	Ilias Apalodimas, Eric Dumazet, Simon Horman, netdev,
	linux-kernel

On 2024/11/21 21:44, Robin Murphy wrote:
> On 21/11/2024 8:04 am, Yunsheng Lin wrote:
>> On 2024/11/21 0:17, Robin Murphy wrote:
>>> On 20/11/2024 10:34 am, Yunsheng Lin wrote:
>>>> Skip dma sync operation for inflight pages before the
>>>> page_pool_destroy() returns to the driver as DMA API
>>>> expects to be called with a valid device bound to a
>>>> driver as mentioned in [1].
>>>>
>>>> After page_pool_destroy() is called, the page is not
>>>> expected to be recycled back to pool->alloc cache and
>>>> dma sync operation is not needed when the page is not
>>>> recyclable or pool->ring is full, so only skip the dma
>>>> sync operation for the infilght pages by clearing the
>>>> pool->dma_sync under protection of rcu lock when page
>>>> is recycled to pool->ring to ensure that there is no
>>>> dma sync operation called after page_pool_destroy() is
>>>> returned.
>>>
>>> Something feels off here - either this is a micro-optimisation which I wouldn't really expect to be meaningful, or it means patch #2 doesn't actually do what it claims. If it really is possible to attempt to dma_sync a page *after* page_pool_inflight_unmap() has already reclaimed and unmapped it, that represents yet another DMA API lifecycle issue, which as well as being even more obviously incorrect usage-wise, could also still lead to the same crash (if the device is non-coherent).
>>
>> For a page_pool owned page, it mostly goes through the below steps:
>> 1. page_pool calls buddy allocator API to allocate a page, call DMA mapping
>>     and sync_for_device API for it if its pool is empty. Or reuse the page in
>>     pool.
>>
>> 2. Driver calls the page_pool API to allocate the page, and pass the page
>>     to network stack after packet is dma'ed into the page and the sync_for_cpu
>>     API is called.
>>
>> 3. Network stack is done with page and called page_pool API to free the page.
>>
>> 4. page_pool releases the page back to buddy allocator if the page is not
>>     recyclable before doing the dma unmaping. Or do the sync_for_device
>>     and put the page in the its pool, the page might go through step 1
>>     again if the driver calls the page_pool allocate API.
>>
>> The calling of dma mapping and dma sync API is controlled by pool->dma_map
>> and pool->dma_sync respectively, the previous patch only clear pool->dma_map
>> after doing the dma unmapping. This patch ensures that there is no dma_sync
>> for recycle case of step 4 by clearing pool->dma_sync.
> 
> But *why* does it want to ensure that? Is there some possible race where one thread can attempt to sync and recycle a page while another thread is attempting to unmap and free it, such that you can't guarantee the correctness of dma_sync calls after page_pool_inflight_unmap() has started, and skipping them is a workaround for that? If so, then frankly I think that would want solving properly, but at the very least this change would need to come before patch #2.

The racing window is something like below. page_pool_destroy() and
page_pool_put_page() can be called concurrently, patch 2 only use
a spinlock to synchronise page_pool_inflight_unmap() with
page_pool_return_page() called by page_pool_put_page() to avoid
concurrent dma unmapping, there is no synchronization between
page_pool_destroy() and page_pool_dma_sync_for_device() called
by page_pool_put_page():
            CPU0                           CPU1
             .                               .
     page_pool_destroy()          page_pool_put_page()
             .                               .
     synchronize_rcu()                       .
             .                               .
   page_pool_inflight_unmap()                .
             .                               .
             .                    __page_pool_put_page()
             .                               .
             .               page_pool_dma_sync_for_device()
             .                               .

After this patch, page_pool_dma_sync_for_device() is protected by
rcu lock and pool->dma_sync is cleared before synchronize_rcu and
page_pool_inflight_unmap() is called after synchronize_rcu to ensure
page_pool_dma_sync_for_device() will not call dma sync API after
synchronize_rcu():

            CPU0                           CPU1
             .                               .
     page_pool_destroy() CPU       page_pool_put_page() CPU
             .                               .
    pool->dma_sync = false                   .
             .                               .
     synchronize_rcu()                       .
             .                               .
   page_pool_inflight_unmap()                .
             .                               .
             .                  page_pool_recycle_in_ring()
             .                               .
             .                        rcu_read_lock()
             .               page_pool_dma_sync_for_device()
             .                        rcu_read_unlock()

Previously patch 2&3 was combined as one patch, this version splits
it out to make it more reviewable.
I am not sure if it matters that much about the patch order as the
fix doesn't seem to be completed unless both patches are included.



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-11-22  7:20 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-20 10:34 [PATCH RFC v4 0/3] fix two bugs related to page_pool Yunsheng Lin
2024-11-20 10:34 ` [PATCH RFC v4 3/3] page_pool: skip dma sync operation for inflight pages Yunsheng Lin
2024-11-20 16:17   ` Robin Murphy
2024-11-21  8:04     ` Yunsheng Lin
2024-11-21 13:44       ` Robin Murphy
2024-11-22  7:20         ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox