linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yunsheng Lin <linyunsheng@huawei.com>
To: "Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: <zhangkun09@huawei.com>, <fanghaiqing@huawei.com>,
	<liuyonglong@huawei.com>, Robin Murphy <robin.murphy@arm.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	IOMMU <iommu@lists.linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	<linux-mm@kvack.org>, <linux-kernel@vger.kernel.org>,
	<netdev@vger.kernel.org>,
	kernel-team <kernel-team@cloudflare.com>,
	Viktor Malik <vmalik@redhat.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
Date: Thu, 7 Nov 2024 19:09:52 +0800	[thread overview]
Message-ID: <30ab6359-2ad6-4be0-bf73-59ae454811a9@huawei.com> (raw)
In-Reply-To: <b8b7818a-e44b-45f5-91c2-d5eceaa5dd5b@kernel.org>

On 2024/11/6 23:57, Jesper Dangaard Brouer wrote:

...

>>
>> Some more info from production servers.
>>
>> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
>>
>> In below bpftrace script/oneliner I'm extracting the inflight count, for
>> all page_pool's in the system, and storing that in a histogram hash.
>>
>> sudo bpftrace -e '
>>   rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>>    @cnt_total[probe]=count();
>>    $pool=(struct page_pool*)arg0;
>>    $release_cnt=(uint32)arg2;
>>    $hold_cnt=$pool->pages_state_hold_cnt;
>>    $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>>    @inflight=hist($inflight_cnt);
>>   }
>>   interval:s:1 {time("\n%H:%M:%S\n");
>>    print(@cnt); clear(@cnt);
>>    print(@inflight);
>>    print(@cnt_total);
>>   }'
>>
>> The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
>>
>> Driver: bnxt_en
>> - kernel 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 8447
>> @inflight:
>> [0]             507 |                                        |
>> [1]             275 |                                        |
>> [2, 4)          261 |                                        |
>> [4, 8)          215 |                                        |
>> [8, 16)         259 |                                        |
>> [16, 32)        361 |                                        |
>> [32, 64)        933 |                                        |
>> [64, 128)      1966 |                                        |
>> [128, 256)   937052 |@@@@@@@@@                               |
>> [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [512, 1K)     73908 |                                        |
>> [1K, 2K)    1220128 |@@@@@@@@@@@@                            |
>> [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         |
>> [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      |
>> [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          |
>> [16K, 32K)   858585 |@@@@@@@@                                |
>> [32K, 64K)   693893 |@@@@@@                                  |
>> [64K, 128K)  170625 |@                                       |
>>
>> Driver: mlx5_core
>>   - Kernel: 6.6.51
>>
>> @cnt[rawtracepoint:page_pool_state_release]: 1975
>> @inflight:
>> [128, 256)         28293 |@@@@                               |
>> [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
>> [512, 1K)              0 |                                   |
>> [1K, 2K)            4671 |                                   |
>> [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>> [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        |
>> [8K, 16K)          96483 |@@@@@@@@@@@@@@                     |
>> [16K, 32K)         25133 |@@@                                |
>> [32K, 64K)          8274 |@                                  |
>>
>>
>> The key thing to notice that we have up-to 128,000 pages in flight on
>> these random production servers. The NIC have 64 RX queue configured,
>> thus also 64 page_pool objects.
>>
> 
> I realized that we primarily want to know the maximum in-flight pages.
> 
> So, I modified the bpftrace oneliner to track the max for each page_pool in the system.
> 
> sudo bpftrace -e '
>  rawtracepoint:page_pool_state_release { @cnt[probe]=count();
>   @cnt_total[probe]=count();
>   $pool=(struct page_pool*)arg0;
>   $release_cnt=(uint32)arg2;
>   $hold_cnt=$pool->pages_state_hold_cnt;
>   $inflight_cnt=(int32)($hold_cnt - $release_cnt);
>   $cur=@inflight_max[$pool];
>   if ($inflight_cnt > $cur) {
>     @inflight_max[$pool]=$inflight_cnt;}
>  }
>  interval:s:1 {time("\n%H:%M:%S\n");
>   print(@cnt); clear(@cnt);
>   print(@inflight_max);
>   print(@cnt_total);
>  }'
> 
> I've attached the output from the script.
> For unknown reason this system had 199 page_pool objects.

Perhaps some of those page_pool objects are per_cpu page_pool
objects from net_page_pool_create()?

It would be good if the pool_size for those page_pool objects
is printed too.

> 
> The 20 top users:
> 
> $ cat out02.inflight-max | grep inflight_max | tail -n 20
> @inflight_max[0xffff88829133d800]: 26473
> @inflight_max[0xffff888293c3e000]: 27042
> @inflight_max[0xffff888293c3b000]: 27709
> @inflight_max[0xffff8881076f2800]: 29400
> @inflight_max[0xffff88818386e000]: 29690
> @inflight_max[0xffff8882190b1800]: 29813
> @inflight_max[0xffff88819ee83800]: 30067
> @inflight_max[0xffff8881076f4800]: 30086
> @inflight_max[0xffff88818386b000]: 31116
> @inflight_max[0xffff88816598f800]: 36970
> @inflight_max[0xffff8882190b7800]: 37336
> @inflight_max[0xffff888293c38800]: 39265
> @inflight_max[0xffff888293c3c800]: 39632
> @inflight_max[0xffff888293c3b800]: 43461
> @inflight_max[0xffff888293c3f000]: 43787
> @inflight_max[0xffff88816598f000]: 44557
> @inflight_max[0xffff888132ce9000]: 45037
> @inflight_max[0xffff888293c3f800]: 51843
> @inflight_max[0xffff888183869800]: 62612
> @inflight_max[0xffff888113d08000]: 73203
> 
> Adding all values together:
> 
>  grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1
> 
> total:1707129
> 
> Worst case we need a data structure holding 1,707,129 pages.

For 64 bit system, that means about 54MB memory overhead for tracking those
inflight pages if 16 byte memory of metadata needed for each page, I guess
that is ok for those large systems.

> Fortunately, we don't need a single data structure as this will be split
> between 199 page_pool's.

It would be good to have an average value for the number of inflight pages,
so that we might be able to have a statically allocated memory to satisfy
the mostly used case, and use the dynamically allocated memory if/when
necessary.

> 
> --Jesper


  parent reply	other threads:[~2024-11-07 11:10 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20241022032214.3915232-1-linyunsheng@huawei.com>
2024-10-22  3:22 ` Yunsheng Lin
2024-10-22 16:40   ` Simon Horman
2024-10-22 18:14   ` Jesper Dangaard Brouer
2024-10-23  8:59     ` Yunsheng Lin
2024-10-24 14:40       ` Toke Høiland-Jørgensen
2024-10-25  3:20         ` Yunsheng Lin
2024-10-25 11:16           ` Toke Høiland-Jørgensen
2024-10-25 14:07             ` Jesper Dangaard Brouer
2024-10-26  7:33               ` Yunsheng Lin
2024-11-06 13:25                 ` Jesper Dangaard Brouer
2024-11-06 15:57                   ` Jesper Dangaard Brouer
2024-11-06 19:55                     ` Alexander Duyck
2024-11-07 11:10                       ` Yunsheng Lin
2024-11-07 11:09                     ` Yunsheng Lin [this message]
2024-11-11 11:31                 ` Yunsheng Lin
2024-11-11 18:51                   ` Toke Høiland-Jørgensen
2024-11-12 12:22                     ` Yunsheng Lin
2024-11-12 14:19                       ` Jesper Dangaard Brouer
2024-11-13 12:21                         ` Yunsheng Lin
     [not found]                         ` <40c9b515-1284-4c49-bdce-c9eeff5092f9@huawei.com>
2024-11-18 15:11                           ` Jesper Dangaard Brouer
2024-10-26  7:32             ` Yunsheng Lin
2024-10-29 13:58               ` Toke Høiland-Jørgensen
2024-10-30 11:30                 ` Yunsheng Lin
2024-10-30 11:57                   ` Toke Høiland-Jørgensen
2024-10-31 12:17                     ` Yunsheng Lin
2024-10-31 16:18                       ` Toke Høiland-Jørgensen
2024-11-01 11:11                         ` Yunsheng Lin
2024-11-05 20:11                           ` Jesper Dangaard Brouer
2024-11-06 10:56                             ` Yunsheng Lin
2024-11-06 14:17                               ` Robin Murphy
2024-11-07  8:41                               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=30ab6359-2ad6-4be0-bf73-59ae454811a9@huawei.com \
    --to=linyunsheng@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fanghaiqing@huawei.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=iommu@lists.linux.dev \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=toke@redhat.com \
    --cc=vmalik@redhat.com \
    --cc=zhangkun09@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox