From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFF44D43341 for ; Thu, 7 Nov 2024 11:10:07 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EA3536B0099; Thu, 7 Nov 2024 06:10:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E52906B009B; Thu, 7 Nov 2024 06:10:06 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CF3D66B009D; Thu, 7 Nov 2024 06:10:06 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id B24756B0099 for ; Thu, 7 Nov 2024 06:10:06 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 69B161A05E6 for ; Thu, 7 Nov 2024 11:10:06 +0000 (UTC) X-FDA: 82759027134.17.E7DC23B Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf28.hostedemail.com (Postfix) with ESMTP id 5A40BC0011 for ; Thu, 7 Nov 2024 11:09:26 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf28.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730977680; a=rsa-sha256; cv=none; b=Gc0RuNvvZy3W3cdtKSAkA93M4sxj0xnpBrfwEZplj2y5AWtJ0Br0EP3rCbRaAF9Axvg6VS wBqEoWE/XXO8fSIz48R3x/P4u2RoflsLbGb8sFOSVSN1jvTiIes1zzdUE7kdOVktCEC7Of G2cLHMdwdXsEQASbpZLUofqXugFHGUM= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf28.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730977680; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YZB13i/8VbyFiUHhB1bE2ySlP3zIyVoXtgmFuY0s0nY=; b=uAYdWCrgejGI+7I121s/ylGcXXJ+BkV/RJLMLuJtjYCyGMnXNWhaZfS9TZfJfK5Os+h62+ 93eZWwva5eZI6fieKKe8zBU7yxqUnUFCraDmOxz58LOUYPFrE2E3a3IpzRBPdm5yw/nSAt bMqXyuEMY2bBixa/BYtUkZk0HkT6wGg= Received: from mail.maildlp.com (unknown [172.19.163.252]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4XkfSP4P1WzpZDy; Thu, 7 Nov 2024 19:08:05 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 8A293180AEA; Thu, 7 Nov 2024 19:09:59 +0800 (CST) Received: from [10.67.120.129] (10.67.120.129) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 7 Nov 2024 19:09:59 +0800 Message-ID: <30ab6359-2ad6-4be0-bf73-59ae454811a9@huawei.com> Date: Thu, 7 Nov 2024 19:09:52 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound To: Jesper Dangaard Brouer , =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , , , CC: , , , Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Ilias Apalodimas , , , , kernel-team , Viktor Malik References: <20241022032214.3915232-1-linyunsheng@huawei.com> <20241022032214.3915232-4-linyunsheng@huawei.com> <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk> <87r084e8lc.fsf@toke.dk> <0c146fb8-4c95-4832-941f-dfc3a465cf91@kernel.org> <204272e7-82c3-4437-bb0d-2c3237275d1f@huawei.com> <18ba4489-ad30-423e-9c54-d4025f74c193@kernel.org> Content-Language: en-US From: Yunsheng Lin In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.120.129] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspamd-Queue-Id: 5A40BC0011 X-Stat-Signature: rufp1ny4et5uorhmrfs8b9ocm965ak65 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1730977766-710374 X-HE-Meta: U2FsdGVkX18DIv+t66QRctRVWXVaCW/m8vfKptMZSBP2Ofk6uCxgi9Y9g0vA80UjNkHl98OfYWYkPbRyMz0bwM3BlEVcuOofAYofrWJTqpvQ1s1S758waLiabEsZloHI9mIGUAGer5P+1+J9aydwcmR3vkksH35S4+1/l3bR24l962YV+atOr4NDkZAAdZCiu6E48i2Jk4foxBYoYzAngpMLj1ThELsGnm7potEHF5w852Twn2jPcnj2N5ZBc0T26MeFbwmiik62SVZpxLo98jPW7kBW8S8JiXCvKOIJFZnGbC0rkAuBVGuj/JMFGLWgjEz9ACrQhr5FMNrMoLGbimIUyNUmo5KduramqqrywrRdYtMriqkLlHpPk5nbXF5aZaarjQLkc5aJeFLcO1iOAVa+YTIiNGRQdXWprC4aqomhGW9BhFNTOqgHaIBZcMHaXhaJtwCEJrjkmoP2PfirBbLjmGYvmK+cuSPJamNrxLkHMAgfAO4sVx2KX/Ldx77DBv5FLJMBMTcbTxVzOCDcPRjVdyuGhgm1G081vK6bv4VkgPjKHfxDTdzqQ382Oruys4VT4d1XCFsnf5FySCIa9bVP02OvUjOLViQ0Mld1hleuhA8KLfexhSnG/ahhOJAj07AZmtm4fVp7kDBjs4y8g3JAHErvNnFStnGk3VeSogaRquB4tvXP0LyBPts/GChwXIinlmNUQzWoVuyNU31ku5mjGm3vf+7GuveoFdysRdD61z4NmdeSNmeUogyqX65im4/4KXoW5hOOn9MDEsK5aBq/G2VUftmPSa/HOqoW1vwbvcsIB6UylG/vB7tTFg0QaZI6dDyUuo+JLplk5y8/9PvkciRX3ivlbBuw6bZ2PgbGwcVfSCPfLzLiaHXq2Fv5WL1G2M5lr830pRcD5MSpVZvK7v4Fzx0IrhF8W2g1J3q3lzL08siXaKjnvnvGFJtFfsV5s/6odCkPcRMSFCe 4Mxerjyl rr1ee+hpD0f/L5eFwaFAsBc1BoinggSyFVZVlsRL5YjfFJoHHFBKZpeRqHeGIaqQxez1rzAXtngJ7E3te+1Fra46zChElc6yWzBiAl036Syqfq/4Abrk0ZRysBhtDSmfBN14GnpzUJz/+vjClK8GN+sX6Q3KXHNUK/VvRCX6cbgM/2Y17ahu7Si8e+sric7ZtYpF/eQnCyJSEiPyxt4HK3wTKa+H+edJzUSCtv8nscRHhk+OqexssFUscFDo352UsEKnL/0uvRib1FrAXZZOApy/OlzbDWr47WAbJiLZhwGDLvV+c/+Mg5NQ1OvCg0ZmBPIzY X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/11/6 23:57, Jesper Dangaard Brouer wrote: ... >> >> Some more info from production servers. >> >> (I'm amazed what we can do with a simple bpftrace script, Cc Viktor) >> >> In below bpftrace script/oneliner I'm extracting the inflight count, for >> all page_pool's in the system, and storing that in a histogram hash. >> >> sudo bpftrace -e ' >>   rawtracepoint:page_pool_state_release { @cnt[probe]=count(); >>    @cnt_total[probe]=count(); >>    $pool=(struct page_pool*)arg0; >>    $release_cnt=(uint32)arg2; >>    $hold_cnt=$pool->pages_state_hold_cnt; >>    $inflight_cnt=(int32)($hold_cnt - $release_cnt); >>    @inflight=hist($inflight_cnt); >>   } >>   interval:s:1 {time("\n%H:%M:%S\n"); >>    print(@cnt); clear(@cnt); >>    print(@inflight); >>    print(@cnt_total); >>   }' >> >> The page_pool behavior depend on how NIC driver use it, so I've run this on two prod servers with drivers bnxt and mlx5, on a 6.6.51 kernel. >> >> Driver: bnxt_en >> - kernel 6.6.51 >> >> @cnt[rawtracepoint:page_pool_state_release]: 8447 >> @inflight: >> [0]             507 |                                        | >> [1]             275 |                                        | >> [2, 4)          261 |                                        | >> [4, 8)          215 |                                        | >> [8, 16)         259 |                                        | >> [16, 32)        361 |                                        | >> [32, 64)        933 |                                        | >> [64, 128)      1966 |                                        | >> [128, 256)   937052 |@@@@@@@@@                               | >> [256, 512)  5178744 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [512, 1K)     73908 |                                        | >> [1K, 2K)    1220128 |@@@@@@@@@@@@                            | >> [2K, 4K)    1532724 |@@@@@@@@@@@@@@@                         | >> [4K, 8K)    1849062 |@@@@@@@@@@@@@@@@@@                      | >> [8K, 16K)   1466424 |@@@@@@@@@@@@@@                          | >> [16K, 32K)   858585 |@@@@@@@@                                | >> [32K, 64K)   693893 |@@@@@@                                  | >> [64K, 128K)  170625 |@                                       | >> >> Driver: mlx5_core >>   - Kernel: 6.6.51 >> >> @cnt[rawtracepoint:page_pool_state_release]: 1975 >> @inflight: >> [128, 256)         28293 |@@@@                               | >> [256, 512)        184312 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        | >> [512, 1K)              0 |                                   | >> [1K, 2K)            4671 |                                   | >> [2K, 4K)          342571 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [4K, 8K)          180520 |@@@@@@@@@@@@@@@@@@@@@@@@@@@        | >> [8K, 16K)          96483 |@@@@@@@@@@@@@@                     | >> [16K, 32K)         25133 |@@@                                | >> [32K, 64K)          8274 |@                                  | >> >> >> The key thing to notice that we have up-to 128,000 pages in flight on >> these random production servers. The NIC have 64 RX queue configured, >> thus also 64 page_pool objects. >> > > I realized that we primarily want to know the maximum in-flight pages. > > So, I modified the bpftrace oneliner to track the max for each page_pool in the system. > > sudo bpftrace -e ' >  rawtracepoint:page_pool_state_release { @cnt[probe]=count(); >   @cnt_total[probe]=count(); >   $pool=(struct page_pool*)arg0; >   $release_cnt=(uint32)arg2; >   $hold_cnt=$pool->pages_state_hold_cnt; >   $inflight_cnt=(int32)($hold_cnt - $release_cnt); >   $cur=@inflight_max[$pool]; >   if ($inflight_cnt > $cur) { >     @inflight_max[$pool]=$inflight_cnt;} >  } >  interval:s:1 {time("\n%H:%M:%S\n"); >   print(@cnt); clear(@cnt); >   print(@inflight_max); >   print(@cnt_total); >  }' > > I've attached the output from the script. > For unknown reason this system had 199 page_pool objects. Perhaps some of those page_pool objects are per_cpu page_pool objects from net_page_pool_create()? It would be good if the pool_size for those page_pool objects is printed too. > > The 20 top users: > > $ cat out02.inflight-max | grep inflight_max | tail -n 20 > @inflight_max[0xffff88829133d800]: 26473 > @inflight_max[0xffff888293c3e000]: 27042 > @inflight_max[0xffff888293c3b000]: 27709 > @inflight_max[0xffff8881076f2800]: 29400 > @inflight_max[0xffff88818386e000]: 29690 > @inflight_max[0xffff8882190b1800]: 29813 > @inflight_max[0xffff88819ee83800]: 30067 > @inflight_max[0xffff8881076f4800]: 30086 > @inflight_max[0xffff88818386b000]: 31116 > @inflight_max[0xffff88816598f800]: 36970 > @inflight_max[0xffff8882190b7800]: 37336 > @inflight_max[0xffff888293c38800]: 39265 > @inflight_max[0xffff888293c3c800]: 39632 > @inflight_max[0xffff888293c3b800]: 43461 > @inflight_max[0xffff888293c3f000]: 43787 > @inflight_max[0xffff88816598f000]: 44557 > @inflight_max[0xffff888132ce9000]: 45037 > @inflight_max[0xffff888293c3f800]: 51843 > @inflight_max[0xffff888183869800]: 62612 > @inflight_max[0xffff888113d08000]: 73203 > > Adding all values together: > >  grep inflight_max out02.inflight-max | awk 'BEGIN {tot=0} {tot+=$2; printf "total:" tot "\n"}' | tail -n 1 > > total:1707129 > > Worst case we need a data structure holding 1,707,129 pages. For 64 bit system, that means about 54MB memory overhead for tracking those inflight pages if 16 byte memory of metadata needed for each page, I guess that is ok for those large systems. > Fortunately, we don't need a single data structure as this will be split > between 199 page_pool's. It would be good to have an average value for the number of inflight pages, so that we might be able to have a statically allocated memory to satisfy the mostly used case, and use the dynamically allocated memory if/when necessary. > > --Jesper