From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1A71CD44D59 for ; Wed, 6 Nov 2024 13:25:11 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9A30B8D0010; Wed, 6 Nov 2024 08:25:10 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 9532B8D0001; Wed, 6 Nov 2024 08:25:10 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7F36A8D0010; Wed, 6 Nov 2024 08:25:10 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 5ECB88D0001 for ; Wed, 6 Nov 2024 08:25:10 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 158661A0332 for ; Wed, 6 Nov 2024 13:25:10 +0000 (UTC) X-FDA: 82755740844.29.08C4FB5 Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91]) by imf25.hostedemail.com (Postfix) with ESMTP id 9B35EA000D for ; Wed, 6 Nov 2024 13:24:43 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=YUrGuBIJ; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf25.hostedemail.com: domain of hawk@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=hawk@kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730899384; a=rsa-sha256; cv=none; b=LedF5pkL2WiJcGmJ+6J+xLrfd6NZTm2vNWCylnYZfad19qgkUoefSJqanGbtVlG0ilKIC/ 0glPkSspDYNIOGl34ct3FyWGZXtyrHmUo97s8PXoro7sB/n2klsT97L2IFK8v7wVLN4n5Q 4h/Z+iINleycjIgUo55nFvafVjxU2UI= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=YUrGuBIJ; dmarc=pass (policy=quarantine) header.from=kernel.org; spf=pass (imf25.hostedemail.com: domain of hawk@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=hawk@kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730899384; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=XPO721igV+ZeNAEO+b/pa5j3awVPKTk8NSakuEY+rpU=; b=p4CcuV7XSCIF6BM98f2ygG0W2AxS7mc/9khKOO8zqlrkGEN23B4vN3y70bgKLslVw2Si6V B409Uojk1LM/7uNykJ2w5HgzhliAViVssMz/3HM+/jbACoUp0NONmvrUBHKuJkpo/a+tr/ YJNE0B6BlSdZC+68/0893/qDS8WfQR8= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by nyc.source.kernel.org (Postfix) with ESMTP id A5DE5A43CA4; Wed, 6 Nov 2024 13:23:12 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 9A80FC4CED3; Wed, 6 Nov 2024 13:25:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1730899507; bh=116sGymRPGxitIATLh63Hu4bt7vFtHZYdaTU44MGasg=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=YUrGuBIJv1qi8hzM0QUpV/DuyXkCd2ksGSx6fggOz070whoF7aDf3hk4NyKamwJya DgoLj27XHpgNGbwBk+PV8DRnLSAy5btBnSSBvDD65Oi3IYhJvwIL0axKwBS7ui4iNR yE5r8ZDlsTGGRary7720MS/x7WG0P4Bw3SEKLWfHEOblNyXtN9O6Wy7nVW7BYyH/Kf 6MUtKypDaBNjUoeOL9FdEniDj6xB1HiJSGqbZ2IzwXBSZ3W8ucCd0noTi3QH1cVu6i 4Qe8Xd1k65bwKMvXXHVhN8IzWdJE2OwCii+u8KG1fmPGvrc2R7ms4rH3SbXIIJb1On lk12pztt8QeSw== Message-ID: <18ba4489-ad30-423e-9c54-d4025f74c193@kernel.org> Date: Wed, 6 Nov 2024 14:25:00 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound To: Yunsheng Lin , =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com, liuyonglong@huawei.com, Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Ilias Apalodimas , linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kernel-team , Viktor Malik References: <20241022032214.3915232-1-linyunsheng@huawei.com> <20241022032214.3915232-4-linyunsheng@huawei.com> <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk> <87r084e8lc.fsf@toke.dk> <0c146fb8-4c95-4832-941f-dfc3a465cf91@kernel.org> <204272e7-82c3-4437-bb0d-2c3237275d1f@huawei.com> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: <204272e7-82c3-4437-bb0d-2c3237275d1f@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: 9B35EA000D X-Stat-Signature: hkcdewrrgbrj93yxhpgu1r1qysqpyxob X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1730899483-850670 X-HE-Meta: U2FsdGVkX18SIqqFEsPMAKqNHTQp/YxhCfxwRE/zpTD5Hm4hbjjthZSdkj0XpHENXqW4d5UrdUDgUpJUxnQA/8uidsqs78yNNf80hObE06j6cgFbWyGvNQI/mgGcwkWaLT4VBiKcsH4CuvBK8NfNiSeGaIBKPF/YEY34cI3hn328gxS/yBs6V+RpMOyLNxWZkDfiKTEsobBTjKE6GbQqcp5sD+MQGHR08h1gNO2xd+6M+/FAOsB04jwmIgNRz56IhgedNFHW2XVGMg5Iw0y2EO1quItWfD086Ieb6wIUHhPr/b3a5PA9y35LIocV0miAKUAjeW1t0MVqb8mjr8a3kAhoG7HIpYQs7YYGpjcUUowTibl1VJR5E54wH0igo7Hp4KbfDxbJ1VZfx27C7ssrByZVqcKl7jmJlHu6LftyeSJr1RB7uKgXO9w3Bv/hkvF+Eq1D3zq9j+1AGg7GOg6dUrT7OSMDtZZzVIW4btf9h7tUT210A7nBSEGeL55ZB/A/IlJ1Zm2Nbty86wRCU5pq859PdY0dY3uI5mQDPgz6AWEddDhvFiCJtxug8ZEiLpuwCgo4Gch7D35JC9L2/gtfHTPkhyZB8Ub3BUPENwhbBOPj5WKkuZgYFZdzo9CJIQ02Y+VSzPb6Wo2eKpSh3JO0RX2s5rfrv8sYwF8Wq7CvdUeoJOrTIKk/GA/zcgcIJZn+dmevyKVIDTRMH5U1P5YResRWiMBbN102JMccVTn0tXezOaeKr3n/lo0jNe4ss3ubn1eAy8chMDkkvJtsZWwG04kC58kNbNsuOQ7uipntGUIMPAHy0bCITQiOFVcJKtno6kqpI8agtbg1rgTc20JxHW6j7jaHtcS7K1QkTg54BCaA6/+NzLmXzJHfrmZCgTdymQpGseVSkhB5dg4vmLuOKdffSRfigBRSnzktoKYJwyq/V4ALBJIrx3G2njpy/JTqYrgHNKHIb1PZJviibZW +rsoXCCy z7t+wyX0faVa53sSQepsFP6QNQ6VtV06P/lVOhCDHi9L+IcZQaiD8RdeAs7rG0/3sOXlOFlpfPAcDE2alQHjNO+pAxWBZUt24Y/eDvwvFw73uT0UKm4sn+iyPOTPfhF7vHAI1N4d3VKoVcA2mwaTfF3JXGQCXfpUpkaIsY21G0pgYz7dZ2/7JNWpX1t/GZVMdYq7KtULz6NLvxrxR9E62TaSIF9CZERCDWnyqA2+iSPWdtLkHAmpBl4GorLVRlWxoTIUD6/n2SAD+VbxeKox3R/PMm531agDvT4Rl9y2VqyA06B5jDg6FbFkcUZZOzCWWXeimZvFx7hBw9XEEAv/rDRpuo6h8AyR/Do5LpNZk3sZxUYSmlQUo82m6w3IFBPTcFfeJC4ZQqS1NiI5ZWY86PEc95YwUWm0AZyUgwMBSmCZYuJ4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 26/10/2024 09.33, Yunsheng Lin wrote: > On 2024/10/25 22:07, Jesper Dangaard Brouer wrote: > > ... > >> >>>> You and Jesper seems to be mentioning a possible fact that there might >>>> be 'hundreds of gigs of memory' needed for inflight pages, it would be nice >>>> to provide more info or reasoning above why 'hundreds of gigs of memory' is >>>> needed here so that we don't do a over-designed thing to support recording >>>> unlimited in-flight pages if the driver unbound stalling turns out impossible >>>> and the inflight pages do need to be recorded. >>> >>> I don't have a concrete example of a use that will blow the limit you >>> are setting (but maybe Jesper does), I am simply objecting to the >>> arbitrary imposing of any limit at all. It smells a lot of "640k ought >>> to be enough for anyone". >>> >> >> As I wrote before. In *production* I'm seeing TCP memory reach 24 GiB >> (on machines with 384GiB memory). I have attached a grafana screenshot >> to prove what I'm saying. >> >> As my co-worker Mike Freemon, have explain to me (and more details in >> blogposts[1]). It is no coincident that graph have a strange "sealing" >> close to 24 GiB (on machines with 384GiB total memory).  This is because >> TCP network stack goes into a memory "under pressure" state when 6.25% >> of total memory is used by TCP-stack. (Detail: The system will stay in >> that mode until allocated TCP memory falls below 4.68% of total memory). >> >>  [1] https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/ > > Thanks for the info. Some more info from production servers. (I'm amazed what we can do with a simple bpftrace script, Cc Viktor) In below bpftrace script/oneliner I'm extracting the inflight count, for all page_pool's in the system, and storing that in a histogram hash. sudo bpftrace -e ' rawtracepoint:page_pool_state_release { @cnt[probe]=count(); @cnt_total[probe]=count(); $pool=(struct page_pool*)arg0; $release_cnt=(uint32)arg2; $hold_cnt=$pool->pages_state_hold_cnt; $inflight_cnt=(int32)($hold_cnt - $release_cnt); @inflight=hist($inflight_cnt); } interval:s:1 {time("\n%H:%M:%S\n"); print(@cnt); clear(@cnt); print(@inflight); print(@cnt_total); }' The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel. Driver: bnxt_en - kernel 6.6.51 @cnt[rawtracepoint:page_pool_state_release]: 8447 @inflight: [0] 507 | | [1] 275 | | [2, 4) 261 | | [4, 8) 215 | | [8, 16) 259 | | [16, 32) 361 | | [32, 64) 933 | | [64, 128) 1966 | | [128, 256) 937052 |@@@@@@@@@ | [256, 512) 5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [512, 1K) 73908 | | [1K, 2K) 1220128 |@@@@@@@@@@@@ | [2K, 4K) 1532724 |@@@@@@@@@@@@@@@ | [4K, 8K) 1849062 |@@@@@@@@@@@@@@@@@@ | [8K, 16K) 1466424 |@@@@@@@@@@@@@@ | [16K, 32K) 858585 |@@@@@@@@ | [32K, 64K) 693893 |@@@@@@ | [64K, 128K) 170625 |@ | Driver: mlx5_core - Kernel: 6.6.51 @cnt[rawtracepoint:page_pool_state_release]: 1975 @inflight: [128, 256) 28293 |@@@@ | [256, 512) 184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [512, 1K) 0 | | [1K, 2K) 4671 | | [2K, 4K) 342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4K, 8K) 180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 96483 |@@@@@@@@@@@@@@ | [16K, 32K) 25133 |@@@ | [32K, 64K) 8274 |@ | The key thing to notice that we have up-to 128,000 pages in flight on these random production servers. The NIC have 64 RX queue configured, thus also 64 page_pool objects. --Jesper