On 2025/1/14 22:31, Jesper Dangaard Brouer wrote: > > > On 10/01/2025 14.06, Yunsheng Lin wrote: >> This patchset fix a possible time window problem for page_pool and >> the dma API misuse problem as mentioned in [1], and try to avoid the >> overhead of the fixing using some optimization. >> >>  From the below performance data, the overhead is not so obvious >> due to performance variations for time_bench_page_pool01_fast_path() >> and time_bench_page_pool02_ptr_ring, and there is about 20ns overhead >> for time_bench_page_pool03_slow() for fixing the bug. >> > > My benchmarking on x86_64 CPUs looks significantly different. >  - CPU: Intel(R) Xeon(R) CPU E5-1650 v4 @ 3.60GHz > > Benchmark (bench_page_pool_simple) results from before and after patchset: > > | Test name  | Cycles |       |    |Nanosec |        |       |      % | > | (tasklet_*)| Before | After |diff| Before |  After |  diff | change | > |------------+--------+-------+----+--------+--------+-------+--------| > | fast_path  |     19 |    24 |   5|  5.399 |  6.928 | 1.529 |   28.3 | > | ptr_ring   |     54 |    79 |  25| 15.090 | 21.976 | 6.886 |   45.6 | > | slow       |    238 |   299 |  61| 66.134 | 83.298 |17.164 |   26.0 | > #+TBLFM: $4=$3-$2::$7=$6-$5::$8=(($7/$5)*100);%.1f > > My above testing show a clear performance regressions across three > different page_pool operating modes. I retested it on arm64 server patch by patch as the raw performance data in the attachment, it seems the result seemed similar as before. Before this patchset: fast_path ptr_ring slow 1. 31.171 ns 60.980 ns 164.917 ns 2. 28.824 ns 60.891 ns 170.241 ns 3. 14.236 ns 60.583 ns 164.355 ns With patch 1-4: 4. 31.443 ns 53.242 ns 210.148 ns 5. 31.406 ns 53.270 ns 210.189 ns With patch 1-5: 6. 26.163 ns 53.781 ns 189.450 ns 7. 26.189 ns 53.798 ns 189.466 ns With patch 1-8: 8. 28.108 ns 68.199 ns 202.516 ns 9. 16.128 ns 55.904 ns 202.711 ns I am not able to get hold of a x86 server yet, I might be able to get one during weekend. Theoretically, patch 1-4 or 1-5 should not have much performance impact for fast_path and ptr_ring except for the rcu_lock mentioned in page_pool_napi_local(), so it would be good if patch 1-5 is also tested in your testlab with the rcu_lock removing in page_pool_napi_local(). > > > Data also available in: >  - https://github.com/xdp-project/xdp-project/blob/main/areas/mem/page_pool07_bench_DMA_fix.org > > Raw data below > > Before this patchset: > > [  157.186644] bench_page_pool_simple: Loaded > [  157.475084] time_bench: Type:for_loop Per elem: 1 cycles(tsc) 0.284 ns (step:0) - (measurement period time:0.284327440 sec time_interval:284327440) - (invoke count:1000000000 tsc_interval:1023590451) > [  162.262752] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 4.769 ns (step:0) - (measurement period time:4.769757001 sec time_interval:4769757001) - (invoke count:1000000000 tsc_interval:17171776113) > [  163.324091] time_bench: Type:lock Per elem: 37 cycles(tsc) 10.431 ns (step:0) - (measurement period time:1.043182161 sec time_interval:1043182161) - (invoke count:100000000 tsc_interval:3755514465) > [  163.341702] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path > [  163.922466] time_bench: Type:no-softirq-page_pool01 Per elem: 20 cycles(tsc) 5.713 ns (step:0) - (measurement period time:0.571357387 sec time_interval:571357387) - (invoke count:100000000 tsc_interval:2056911063) > [  163.941429] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path > [  165.506796] time_bench: Type:no-softirq-page_pool02 Per elem: 56 cycles(tsc) 15.560 ns (step:0) - (measurement period time:1.556080558 sec time_interval:1556080558) - (invoke count:100000000 tsc_interval:5601960921) > [  165.525978] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path > [  171.811289] time_bench: Type:no-softirq-page_pool03 Per elem: 225 cycles(tsc) 62.763 ns (step:0) - (measurement period time:6.276301531 sec time_interval:6276301531) - (invoke count:100000000 tsc_interval:22594974468) > [  171.830646] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path > [  171.838561] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path > [  172.387597] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 19 cycles(tsc) 5.399 ns (step:0) - (measurement period time:0.539904228 sec time_interval:539904228) - (invoke count:100000000 tsc_interval:1943679246) > [  172.407130] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path > [  173.925266] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 54 cycles(tsc) 15.090 ns (step:0) - (measurement period time:1.509075496 sec time_interval:1509075496) - (invoke count:100000000 tsc_interval:5432740575) > [  173.944878] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path > [  180.567094] time_bench: Type:tasklet_page_pool03_slow Per elem: 238 cycles(tsc) 66.134 ns (step:0) - (measurement period time:6.613430605 sec time_interval:6613430605) - (invoke count:100000000 tsc_interval:23808654870) > > > > After this patchset: > [  860.519918] bench_page_pool_simple: Loaded > [  860.781605] time_bench: Type:for_loop Per elem: 0 cycles(tsc) 0.257 ns (step:0) - (measurement period time:0.257573336 sec time_interval:257573336) - (invoke count:1000000000 tsc_interval:927275355) > [  865.613893] time_bench: Type:atomic_inc Per elem: 17 cycles(tsc) 4.814 ns (step:0) - (measurement period time:4.814593429 sec time_interval:4814593429) - (invoke count:1000000000 tsc_interval:17332768494) > [  866.708420] time_bench: Type:lock Per elem: 38 cycles(tsc) 10.763 ns (step:0) - (measurement period time:1.076362960 sec time_interval:1076362960) - (invoke count:100000000 tsc_interval:3874955595) > [  866.726118] bench_page_pool_simple: time_bench_page_pool01_fast_path(): Cannot use page_pool fast-path > [  867.423572] time_bench: Type:no-softirq-page_pool01 Per elem: 24 cycles(tsc) 6.880 ns (step:0) - (measurement period time:0.688069107 sec time_interval:688069107) - (invoke count:100000000 tsc_interval:2477080260) > [  867.442517] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): Cannot use page_pool fast-path > [  869.436286] time_bench: Type:no-softirq-page_pool02 Per elem: 71 cycles(tsc) 19.844 ns (step:0) - (measurement period time:1.984451929 sec time_interval:1984451929) - (invoke count:100000000 tsc_interval:7144120329) > [  869.455492] bench_page_pool_simple: time_bench_page_pool03_slow(): Cannot use page_pool fast-path > [  877.071437] time_bench: Type:no-softirq-page_pool03 Per elem: 273 cycles(tsc) 76.069 ns (step:0) - (measurement period time:7.606911291 sec time_interval:7606911291) - (invoke count:100000000 tsc_interval:27385252251) > [  877.090762] bench_page_pool_simple: pp_tasklet_handler(): in_serving_softirq fast-path > [  877.098683] bench_page_pool_simple: time_bench_page_pool01_fast_path(): in_serving_softirq fast-path > [  877.800696] time_bench: Type:tasklet_page_pool01_fast_path Per elem: 24 cycles(tsc) 6.928 ns (step:0) - (measurement period time:0.692852876 sec time_interval:692852876) - (invoke count:100000000 tsc_interval:2494303293) > [  877.820224] bench_page_pool_simple: time_bench_page_pool02_ptr_ring(): in_serving_softirq fast-path > [  880.026911] time_bench: Type:tasklet_page_pool02_ptr_ring Per elem: 79 cycles(tsc) 21.976 ns (step:0) - (measurement period time:2.197615122 sec time_interval:2197615122) - (invoke count:100000000 tsc_interval:7911521190) > [  880.046528] bench_page_pool_simple: time_bench_page_pool03_slow(): in_serving_softirq fast-path > [  888.385235] time_bench: Type:tasklet_page_pool03_slow Per elem: 299 cycles(tsc) 83.298 ns (step:0) - (measurement period time:8.329893717 sec time_interval:8329893717) - (invoke count:100000000 tsc_interval:29988024696) As mentioned by Toke, we may be able to reduce the performance difference between tasklet and non-tasklet testcases by removing the rcu_lock in page_pool_napi_local() for patch 1 as in_softirq() checking in page_pool_napi_local() should ensure RCU-bh read-side critical section.