From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 379BED44D46 for ; Wed, 6 Nov 2024 10:56:45 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id B084B6B0088; Wed, 6 Nov 2024 05:56:44 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A91016B008C; Wed, 6 Nov 2024 05:56:44 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 914416B0092; Wed, 6 Nov 2024 05:56:44 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 6F18D6B0088 for ; Wed, 6 Nov 2024 05:56:44 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 11661141B99 for ; Wed, 6 Nov 2024 10:56:44 +0000 (UTC) X-FDA: 82755365406.03.7ACC1C7 Received: from szxga02-in.huawei.com (szxga02-in.huawei.com [45.249.212.188]) by imf12.hostedemail.com (Postfix) with ESMTP id 6C22E40012 for ; Wed, 6 Nov 2024 10:56:25 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf12.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730890433; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1DayXtp2kMRE5fUUSdnydsjR5CQqM0KEpMJfxNVhc50=; b=m2w2aah63h0zLelK8q8odFqXiXuZRCRAwbYBJe9k0h7Tmnn6ZOS6Dk3PxYPMGVmQmkzu9b 4c2EM9PFFi0oS/ibMNhmZ2dI4PYGkK5Dip+I37NUSJrMJWtDuw2Mz7QbJc4hK+c0NB7xAJ 0+THIgWEUHi9pEU3Zmjmja+xOqyvJdQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf12.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.188 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730890433; a=rsa-sha256; cv=none; b=vgBu4kI8IiZXr72JhS5B58QkgeFFWLIxHI5ff4hvaFHUWkSFPOCkZir80Tc3iwV4c+zSQB fxbUpr88JePn41V5dIQpYN19JTs/sGT05+pD37MlwzBU9aKcI8pZQ+80uG0MRf5x9giElB CphXzjJwAra0MC8/AWVOSSZDqvC2sdM= Received: from mail.maildlp.com (unknown [172.19.88.105]) by szxga02-in.huawei.com (SkyGuard) with ESMTP id 4Xk2BZ0KDMzNpR6; Wed, 6 Nov 2024 18:53:58 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id E382C140361; Wed, 6 Nov 2024 18:56:34 +0800 (CST) Received: from [10.67.120.129] (10.67.120.129) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Wed, 6 Nov 2024 18:56:34 +0800 Message-ID: <2f256bce-0c37-4940-9218-9545daa46169@huawei.com> Date: Wed, 6 Nov 2024 18:56:34 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound To: Jesper Dangaard Brouer , =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , , , CC: , , , Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Ilias Apalodimas , , , , kernel-team , Christoph Hellwig , References: <20241022032214.3915232-1-linyunsheng@huawei.com> <20241022032214.3915232-4-linyunsheng@huawei.com> <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk> <87r084e8lc.fsf@toke.dk> <878qu7c8om.fsf@toke.dk> <1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com> <87o731by64.fsf@toke.dk> <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com> <874j4sb60w.fsf@toke.dk> Content-Language: en-US From: Yunsheng Lin In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.120.129] X-ClientProxiedBy: dggems704-chm.china.huawei.com (10.3.19.181) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 6C22E40012 X-Stat-Signature: 6fnxxw3rq56otj8d78ytwzcg9xt4fdoi X-Rspam-User: X-HE-Tag: 1730890585-369214 X-HE-Meta: U2FsdGVkX18EpDFW/VgWAcJ8eRYfvZ1KTlDSsv3AX4Z0pdwDWWSkHF15ZRCJC4K5VJ+b4POZWCAO3gEUAa2Y1z1xCwBGxsgQ+u6VPrToUTzJR1USzpsPjvBzksC3NXnX9jvyMziOwT2xMDXcnb+GXzFkx866u0OGdVqS5UyvtnRqcqU/kBw4LEHxohj2fjfgIWGFONBab4BwmOrKAvMdS5i++JgRscVr5cpbQVjnuha/gwk8EhnyXb0qTLzclK5ClNC17gPmjg00piMvTXiMDg1XLxhgcQDMRanlraGf+vQlsjmwk1uFUrOlvdRDlJ8Gfi3GwcVRm0bpO0lhtekZF4xQsDvlonaOogVbREhvMoH2Ddv+rHVEJp2OIWCOw1kHxpYHRsgLf7NFIcl9R7yL4G/GrNzGNyMwEVqYZv8RHS9vZx/qZ3HL+O7cpFPysATKanlEQVtsRBd5gak1sGTgmoXzh/kqKzoAvdCVJLr31/uyULyGsuLkK+D3Q3LGR24mEwFxNMvNT8kVEpQi8TDfVL+my5jemfqcs37MmLk5LkmedF+EHueS75n9LRBWRP+/5a2bZDg4o/l+mOQpVfyx2VTCrsNG21/pCq/3OW5ucXqZXhzcjcyljfyIB2CK1qK0M5TADe+wJxacZQ1XjEkiWNUn1JQ7prjEbOCvu2JycpXXyhReA4boQdd1IJ8Vxc7FcdhIFB1XnUPsEHoj9Dsxwok6PNPccDYCsD/ocO63DnyCM/TvUrUyXxl2BXrDvgxV7QheFp7thsEWNmZQaW+EAHzjPQPz+61weIs7HdvgAAwiBRSwPitBQ0flJPxVdlkinMfWVzpjsdHWQfYm+vjj5mgM6omvOwSN+TSYr2WwwphQc53FAmqM0KnMuXWdGRbaoEgbDdwKZ16LjvUPZh/zHXHmDYix0RH8qLtDXXStwa6/mx5ibiBLfv7XFe1SwUqgWHFd7E0knpYxreYN4pZ iaE+PCsa goRhSOlzeM4J9qP1DNExo/wB1/fDDk1g6ddw3pxto5aHiWNw+z9Er9A5iA1Du5CsS4InGVssUoHDeWb/c0PUrJHYR0Nd+yPsWHBSSDedhBgRk40OjIyd7iFHqF2Bqo7zs7yUXrDleZ0BzXk1+HxoXecmUO9YgdVqD9gpHsjRhPbuGakk707BvnIon8X5oxADcWFqcRScXe9tRPVb+WnWujI+HhUJb8AvsR02VC/wcVa8ESZrhPYQ8JWNmbiMf3GL6pjbb6TnSo331xH5wXhPdYcATRp1KyJwsOWMk/sjigsRbqoIo6PnMUKhPG4z4I4uMY8ILi+gCRA9FzRg6ihHANwh40Az/I2rmGO8VJyW9E4nLC0Ypiq1Ojx12BPYfsFKTeUcxoBvs0jc56S/7uh2CAQbbW/T8pInqoWBjhcj81RizAOTSs619+G/jXcxfSwB2EN3GYlTLKySEwrxb7t3/eyvUSmAAicoyz7mHMVSNPMhLrSV4EGXvQr/t3LzRHsw+AyccM6rfo+Lnmw4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: +cc Christoph & Marek On 2024/11/6 4:11, Jesper Dangaard Brouer wrote: ... >>> >>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait >>>> and see if this actually turns out to be a problem' when we already know that there >>>> are some cases which need cache kicking/flushing for the waiting to work and those >>>> kicking/flushing may not be easy and may take indefinite time too, not to mention >>>> there might be other cases that need kicking/flushing that we don't know yet. >>>> >>>> Is there any reason not to consider recording the inflight pages so that unmapping >>>> can be done for inflight pages before driver unbound supposing dynamic number of >>>> inflight pages can be supported? >>>> >>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the >>>> inflight pages is bad supposing the inflight pages can be unlimited and recording >>>> can be done with least performance overhead? >>> >>> Well, page pool is a memory allocator, and it already has a mechanism to >>> handle returning of memory to it. You're proposing to add a second, >>> orthogonal, mechanism to do this, one that adds both overhead and >> >> I would call it as a replacement/improvement for the old one instead of >> 'a second, orthogonal' as the old one doesn't really exist after this patch. >> > > Yes, are proposing doing a very radical change to the page_pool design. > And this is getting proposed as a fix patch for IOMMU. > > It is a very radical change that page_pool needs to keep track of *ALL* in-flight pages. I am agreed that it is a radical change, that is why it is targetting net-next tree instead of net tree even when there is a Fixes tag for it. If there is a proper and non-radical way to fix that, I would prefer the non-radical way too. > > The DMA issue is a life-time issue of DMA object associated with the > struct device.  Then, why are you not looking at extending the life-time It seems it is not really about the life-time of DMA object associated with the life-time of 'struct device', it seems to be the life-time of DMA API associated with the life-time of the driver for the 'struct device' from the the opinion of experts from IOMMU/DMA subsystem in [1] & [2]. I am not sure what is reasoning behind the above, but the implementation seems to be the case as mentioned in [3]: __device_release_driver -> device_unbind_cleanup -> arch_teardown_dma_ops 1. https://lkml.org/lkml/2024/8/6/632 2. https://lore.kernel.org/all/20240923175226.GC9634@ziepe.ca/ 3. https://lkml.org/lkml/2024/10/15/686 > of the DMA object, or at least detect when DMA object goes away, such > that we can change a setting in page_pool to stop calling DMA unmap for > the pages in-flight once they get returned (which we have en existing > mechanism for). To be honest, I was mostly depending on the opinion of the experts from IOMMU/DMA subsystem for the correct DMA API usage as mentioned above. So I am not sure if skipping DMA unmapping for the inflight pages is the correct DMA API usage? If it is the correct DMA API usage, how to detect that if DMA unmapping can be skipped? >From previous discussion, skipping DMA unmapping may casue some resource leaking, like the iova resoure behind the IOMMU and bounce buffer memory behind the swiotlb. Anyway, I may be wrong, CC'ing more experts to see if we can have some clarifying from them. > > >>> complexity, yet doesn't handle all cases (cf your comment about devmem). >> >> I am not sure if unmapping only need to be done using its own version DMA API >> for devmem yet, but it seems waiting might also need to use its own version >> of kicking/flushing for devmem as devmem might be held from the user space? >> >>> >>> And even if it did handle all cases, force-releasing pages in this way >>> really feels like it's just papering over the issue. If there are pages >>> being leaked (or that are outstanding forever, which basically amounts >>> to the same thing), that is something we should be fixing the root cause >>> of, not just working around it like this series does. >> >> If there is a definite time for waiting, I am probably agreed with the above. >>  From the previous discussion, it seems the time to do the kicking/flushing >> would be indefinite depending how much cache to be scaned/flushed. >> >> For the 'papering over' part, it seems it is about if we want to paper over >> different kicking/flushing or paper over unmapping using different DMA API. >> >> Also page_pool is not really a allocator, instead it is more like a pool >> based on different allocator, such as buddy allocator or devmem allocator. >> I am not sure it makes much to do the flushing when page_pool_destroy() is >> called if the buddy allocator behind the page_pool is not under memory >> pressure yet. >> > > I still see page_pool as an allocator like the SLUB/SLAB allocators, > where slab allocators are created (and can be destroyed again), which we > can allocate slab objects from.  SLAB allocators also use buddy > allocator as their backing allocator. I am not sure if SLUB/SLAB is that similar to page_pool for the specific problem here, at least SLUB/SLAB doesn't seems to support dma mapping in its core and doesn't seem to allow inflight cache when kmem_cache_destroy() is called as its alloc API doesn't seems to take reference to s->refcount and doesn't have the inflight cache calculating as page_pool does? https://elixir.bootlin.com/linux/v6.12-rc6/source/mm/slab_common.c#L512 > > The page_pool is of-cause evolving with the addition of the devmem > allocator as a different "backing" allocator type.