From: Jesper Dangaard Brouer <hawk@kernel.org>
To: "Yunsheng Lin" <linyunsheng@huawei.com>,
"Toke Høiland-Jørgensen" <toke@redhat.com>,
davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com,
liuyonglong@huawei.com, Robin Murphy <robin.murphy@arm.com>,
Alexander Duyck <alexander.duyck@gmail.com>,
IOMMU <iommu@lists.linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Eric Dumazet <edumazet@google.com>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, kernel-team <kernel-team@cloudflare.com>,
Viktor Malik <vmalik@redhat.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
Date: Wed, 6 Nov 2024 14:25:00 +0100 [thread overview]
Message-ID: <18ba4489-ad30-423e-9c54-d4025f74c193@kernel.org> (raw)
In-Reply-To: <204272e7-82c3-4437-bb0d-2c3237275d1f@huawei.com>
On 26/10/2024 09.33, Yunsheng Lin wrote:
> On 2024/10/25 22:07, Jesper Dangaard Brouer wrote:
>
> ...
>
>>
>>>> You and Jesper seems to be mentioning a possible fact that there might
>>>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice
>>>> to provide more info or reasoning above why 'hundreds of gigs of memory' is
>>>> needed here so that we don't do a over-designed thing to support recording
>>>> unlimited in-flight pages if the driver unbound stalling turns out impossible
>>>> and the inflight pages do need to be recorded.
>>>
>>> I don't have a concrete example of a use that will blow the limit you
>>> are setting (but maybe Jesper does), I am simply objecting to the
>>> arbitrary imposing of any limit at all. It smells a lot of "640k ought
>>> to be enough for anyone".
>>>
>>
>> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB
>> (on machines with 384GiB memory). I have attached a grafana screenshot
>> to prove what I'm saying.
>>
>> As my co-worker Mike Freemon, have explain to me (and more details in
>> blogposts[1]). It is no coincident that graph have a strange "sealing"
>> close to 24 GiB (on machines with 384GiB total memory). This is because
>> TCP network stack goes into a memory "under pressure" state when 6.25%
>> of total memory is used by TCP-stack. (Detail: The system will stay in
>> that mode until allocated TCP memory falls below 4.68% of total memory).
>>
>> [1] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>
> Thanks for the info.
Some more info from production servers.
(I'm amazed what we can do with a simple bpftrace script, Cc Viktor)
In below bpftrace script/oneliner I'm extracting the inflight count, for
all page_pool's in the system, and storing that in a histogram hash.
sudo bpftrace -e '
rawtracepoint:page_pool_state_release { @cnt[probe]=count();
@cnt_total[probe]=count();
$pool=(struct page_pool*)arg0;
$release_cnt=(uint32)arg2;
$hold_cnt=$pool->pages_state_hold_cnt;
$inflight_cnt=(int32)($hold_cnt - $release_cnt);
@inflight=hist($inflight_cnt);
}
interval:s:1 {time("\n%H:%M:%S\n");
print(@cnt); clear(@cnt);
print(@inflight);
print(@cnt_total);
}'
The page_pool behavior depend on how NIC driver use it, so I've run this
on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel.
Driver: bnxt_en
- kernel 6.6.51
@cnt[rawtracepoint:page_pool_state_release]: 8447
@inflight:
[0] 507 | |
[1] 275 | |
[2, 4) 261 | |
[4, 8) 215 | |
[8, 16) 259 | |
[16, 32) 361 | |
[32, 64) 933 | |
[64, 128) 1966 | |
[128, 256) 937052 |@@@@@@@@@ |
[256, 512) 5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 73908 | |
[1K, 2K) 1220128 |@@@@@@@@@@@@ |
[2K, 4K) 1532724 |@@@@@@@@@@@@@@@ |
[4K, 8K) 1849062 |@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 1466424 |@@@@@@@@@@@@@@ |
[16K, 32K) 858585 |@@@@@@@@ |
[32K, 64K) 693893 |@@@@@@ |
[64K, 128K) 170625 |@ |
Driver: mlx5_core
- Kernel: 6.6.51
@cnt[rawtracepoint:page_pool_state_release]: 1975
@inflight:
[128, 256) 28293 |@@@@ |
[256, 512) 184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[512, 1K) 0 | |
[1K, 2K) 4671 | |
[2K, 4K) 342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K) 180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[8K, 16K) 96483 |@@@@@@@@@@@@@@ |
[16K, 32K) 25133 |@@@ |
[32K, 64K) 8274 |@ |
The key thing to notice that we have up-to 128,000 pages in flight on
these random production servers. The NIC have 64 RX queue configured,
thus also 64 page_pool objects.
--Jesper
next prev parent reply other threads:[~2024-11-06 13:25 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20241022032214.3915232-1-linyunsheng@huawei.com>
2024-10-22 3:22 ` Yunsheng Lin
2024-10-22 16:40 ` Simon Horman
2024-10-22 18:14 ` Jesper Dangaard Brouer
2024-10-23 8:59 ` Yunsheng Lin
2024-10-24 14:40 ` Toke Høiland-Jørgensen
2024-10-25 3:20 ` Yunsheng Lin
2024-10-25 11:16 ` Toke Høiland-Jørgensen
2024-10-25 14:07 ` Jesper Dangaard Brouer
2024-10-26 7:33 ` Yunsheng Lin
2024-11-06 13:25 ` Jesper Dangaard Brouer [this message]
2024-11-06 15:57 ` Jesper Dangaard Brouer
2024-11-06 19:55 ` Alexander Duyck
2024-11-07 11:10 ` Yunsheng Lin
2024-11-07 11:09 ` Yunsheng Lin
2024-11-11 11:31 ` Yunsheng Lin
2024-11-11 18:51 ` Toke Høiland-Jørgensen
2024-11-12 12:22 ` Yunsheng Lin
2024-11-12 14:19 ` Jesper Dangaard Brouer
2024-11-13 12:21 ` Yunsheng Lin
[not found] ` <40c9b515-1284-4c49-bdce-c9eeff5092f9@huawei.com>
2024-11-18 15:11 ` Jesper Dangaard Brouer
2024-10-26 7:32 ` Yunsheng Lin
2024-10-29 13:58 ` Toke Høiland-Jørgensen
2024-10-30 11:30 ` Yunsheng Lin
2024-10-30 11:57 ` Toke Høiland-Jørgensen
2024-10-31 12:17 ` Yunsheng Lin
2024-10-31 16:18 ` Toke Høiland-Jørgensen
2024-11-01 11:11 ` Yunsheng Lin
2024-11-05 20:11 ` Jesper Dangaard Brouer
2024-11-06 10:56 ` Yunsheng Lin
2024-11-06 14:17 ` Robin Murphy
2024-11-07 8:41 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=18ba4489-ad30-423e-9c54-d4025f74c193@kernel.org \
--to=hawk@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alexander.duyck@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fanghaiqing@huawei.com \
--cc=ilias.apalodimas@linaro.org \
--cc=iommu@lists.linux.dev \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyunsheng@huawei.com \
--cc=liuyonglong@huawei.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=robin.murphy@arm.com \
--cc=toke@redhat.com \
--cc=vmalik@redhat.com \
--cc=zhangkun09@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox