From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5BA9BD44D75 for ; Wed, 6 Nov 2024 14:17:41 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C5E108D000F; Wed, 6 Nov 2024 09:17:40 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C0F618D0001; Wed, 6 Nov 2024 09:17:40 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AAE0B8D000F; Wed, 6 Nov 2024 09:17:40 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 895E48D0001 for ; Wed, 6 Nov 2024 09:17:40 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 0A6C7A030C for ; Wed, 6 Nov 2024 14:17:40 +0000 (UTC) X-FDA: 82755872472.22.1BD9540 Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf01.hostedemail.com (Postfix) with ESMTP id 0E10D40025 for ; Wed, 6 Nov 2024 14:17:10 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf01.hostedemail.com: domain of robin.murphy@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=robin.murphy@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730902521; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=f+I+u3o0hGARX96/7gnI1nr8GmU1WHOUT/aqoDpxXak=; b=A1htKqpsDC98FZRV2pgf+oCfYS/Adc1ledUXugSoLGflr03RBJiBZl+rBrTrOE7PmWiePl HNeLWGQhkefEkLWCLqQ88uv+UudK6du3D2qx+Hig+MT6Jo7Z0DnwqqTBoWJslU86hLS8C/ 9HU9OSUr4sA+9Fu04/hN2sYt7OyvsCg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730902521; a=rsa-sha256; cv=none; b=h/DzA6MY4vSU+ACLGzx76FkZrLUt7xX/osSf5JM72wcSo4kd8/nWrda+L9uTkv+O0hgew0 tSHoSMldCqUBJcshtGczXEjYtx3M5Ol7qAfdQm+aYbbPn7xs1nms23UhrY0R0winj3LrE1 y2EpLw0fo4+5GuBrECPuYHfxfLi0uNI= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf01.hostedemail.com: domain of robin.murphy@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=robin.murphy@arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 9B508497; Wed, 6 Nov 2024 06:18:06 -0800 (PST) Received: from [10.57.90.5] (unknown [10.57.90.5]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 875783F66E; Wed, 6 Nov 2024 06:17:33 -0800 (PST) Message-ID: Date: Wed, 6 Nov 2024 14:17:31 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound To: Yunsheng Lin , Jesper Dangaard Brouer , =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= , davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com, liuyonglong@huawei.com, Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Ilias Apalodimas , linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kernel-team , Christoph Hellwig , m.szyprowski@samsung.com References: <20241022032214.3915232-1-linyunsheng@huawei.com> <20241022032214.3915232-4-linyunsheng@huawei.com> <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk> <87r084e8lc.fsf@toke.dk> <878qu7c8om.fsf@toke.dk> <1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com> <87o731by64.fsf@toke.dk> <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com> <874j4sb60w.fsf@toke.dk> <2f256bce-0c37-4940-9218-9545daa46169@huawei.com> From: Robin Murphy Content-Language: en-GB In-Reply-To: <2f256bce-0c37-4940-9218-9545daa46169@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 0E10D40025 X-Stat-Signature: zqf1zgu16c9nkfuunc3x8cfis7eeeuet X-HE-Tag: 1730902630-163278 X-HE-Meta: U2FsdGVkX1+LFFGFl/atlh5rhh7vMT5Ab9Nywb5hSDgo5rsALO2NnRhKuoTPHAtCugkJlid24UT4WaKJvSe5cc7c/mpxx7IXbd/dZnq8KIObSklIlm0c1oC4HYCWrTA6JutIP5yO97hqO5rCgZfhkbEsPquq9rxygvhHT0VmJe1RUStSrKM3l029NePy3SLe2m0NPjm5Ys9iitcj4YD+aAi7lKCBs5BPFkOSB7SAb5GTEzQaRVUe/7IK4evkS7T/iK2O5UWqVQTpSnuYHzrv8hegRuphKpUDNhGB/3nHbDSCygAP+XAwOhlbwHHMvoV5vtaTXX2KbUVtSs/aj81BBjzDxJ8V+n0D+2eIikxOGce8cyf7ya52SlAe7wKWgSAdrGx23WirZytKb63RnPj+UPOg4pk0kIykMX1B5cOYxvdf+ekAVl20ZyEsy9mR17WZH3SSxuOfnvJBXRUDJYA9eF/jCxSegl6SQYEs7pWNtUfGbibElr7Xzf5v42CwYNbimshEREq6z6v38Ci49wbjmb6yqb5KwEJIsN6ImRkLrZ2ZK2vPJRj7/nFQ2FURIRbKdsMoGU7fXXHbOKw+4JUi0olpqKk/neMYzLoJa6ZvMc7xvv5JJoeKhoeW4QUWrgni/AtscZlaSkb6D8njcPRnvD0vRAHHKrMZlO2UB9bJrB4ylghYr6shfx3kLE/K/5XcMUtRekHAK5A754IAB/dRhOEBE5VBkjt9gpRDV4YQBFtG0kjxslwIh+DSiI3LpAhNqeHRr9egILdYjyRDcWZK/xne6PgsFni4VypP6NFfCXPvqiBdKCy+B4nEKcJsWIYMBTe87Xai5yelZXTxWLANYmJceHgbzCC7Li8/e3WG5Nb4bbZS4N289JPqkcmCIzx84ZDFo0+joVqxMiqD1geknNZuJuv5XfXMIFuKuMWfLP6UYLVexK+C2uIuMv8eLad2zloXDY9Bme9XIzoeHXm MOE8Zh0Y 0Jr75CBq3jVspHW9KXWHMCLLLU4TKmvRIm+7Qd/g5EbLQNeX89Wm3SAwiAReZnbISymGmyn15BgGSgy4nIb+ayl7yxpaVsV+QTP5Yr1HYrRwrEezou40WMt/4ZoXaTDagvgnmRI3kZZ2YuvOV5Xeb0k0PeuRzHFkAuO0Z4aC/VnDct6nouF8KLCSYLg/mnlxkkUVEk2FTStodgkg8Uf0Ai6WLQ5Ll83f65yiiZvKv39LzhNAmcU6Y/jabKcf4nz5UH/EQeL9hqjvIMUmV84kuJgY/xObAKHPqJXok6s4fFYJpNQLkj6hQ1ujvpepscdPB8pprhaNZmNKYKZdVwlGmgehRi88I5RdLp0KAr7AfApTprGWz5H400o6k53gRnGC4TCcUQX67kJGNqTSIakcPa7eJEDezLxEB49N8zWl6zUe28B4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024-11-06 10:56 am, Yunsheng Lin wrote: > +cc Christoph & Marek > > On 2024/11/6 4:11, Jesper Dangaard Brouer wrote: > > ... > >>>> >>>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait >>>>> and see if this actually turns out to be a problem' when we already know that there >>>>> are some cases which need cache kicking/flushing for the waiting to work and those >>>>> kicking/flushing may not be easy and may take indefinite time too, not to mention >>>>> there might be other cases that need kicking/flushing that we don't know yet. >>>>> >>>>> Is there any reason not to consider recording the inflight pages so that unmapping >>>>> can be done for inflight pages before driver unbound supposing dynamic number of >>>>> inflight pages can be supported? >>>>> >>>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the >>>>> inflight pages is bad supposing the inflight pages can be unlimited and recording >>>>> can be done with least performance overhead? >>>> >>>> Well, page pool is a memory allocator, and it already has a mechanism to >>>> handle returning of memory to it. You're proposing to add a second, >>>> orthogonal, mechanism to do this, one that adds both overhead and >>> >>> I would call it as a replacement/improvement for the old one instead of >>> 'a second, orthogonal' as the old one doesn't really exist after this patch. >>> >> >> Yes, are proposing doing a very radical change to the page_pool design. >> And this is getting proposed as a fix patch for IOMMU. >> >> It is a very radical change that page_pool needs to keep track of *ALL* in-flight pages. > > I am agreed that it is a radical change, that is why it is targetting net-next > tree instead of net tree even when there is a Fixes tag for it. > > If there is a proper and non-radical way to fix that, I would prefer the > non-radical way too. > >> >> The DMA issue is a life-time issue of DMA object associated with the >> struct device.  Then, why are you not looking at extending the life-time > > It seems it is not really about the life-time of DMA object associated with the > life-time of 'struct device', it seems to be the life-time of DMA API associated > with the life-time of the driver for the 'struct device' from the the opinion of > experts from IOMMU/DMA subsystem in [1] & [2]. There is no "DMA object". The DMA API expects to be called with a valid device bound to a driver. There are parts in many different places all built around that expectation to varying degrees. Looking again, it seems dma_debug_device_change() has existed for way longer than the page_pool code, so frankly I'm a little surprised that this case is only coming up now in this context... Even if one tries to handwave past that with a bogus argument that technically these DMA mappings belong to the subsystem rather than the driver itself, it is clearly unrealistic to imagine that once a device is torn down by device_del() it's still valid for anything. In fact, before even that point, it is explicitly documented that a device which is merely offlined prior to potential removal "cannot be used for any purpose", per Documentation/ABI/testing/sysfs-devices-online. Holding a refcount so that the memory backing the struct device can still be accessed without a literal use-after-free does not represent the device being conceptually valid in any API-level sense. Even if the device isn't removed, as soon as its driver is unbound its DMA ops can change; the driver could then be re-bound, and the device valid for *new* DMA mappings again, but it's still bogus to attempt to unmap outstanding old mappings through the new ops (which is just as likely to throw an error/crash/corrupt memory/whatever). The page pool DMA mapping design is just fundamentally incorrect with respect to the device/driver model lifecycle. > I am not sure what is reasoning behind the above, but the implementation seems > to be the case as mentioned in [3]: > __device_release_driver -> device_unbind_cleanup -> arch_teardown_dma_ops > > 1. https://lkml.org/lkml/2024/8/6/632 > 2. https://lore.kernel.org/all/20240923175226.GC9634@ziepe.ca/ > 3. https://lkml.org/lkml/2024/10/15/686 > >> of the DMA object, or at least detect when DMA object goes away, such >> that we can change a setting in page_pool to stop calling DMA unmap for >> the pages in-flight once they get returned (which we have en existing >> mechanism for). > > To be honest, I was mostly depending on the opinion of the experts from > IOMMU/DMA subsystem for the correct DMA API usage as mentioned above. > So I am not sure if skipping DMA unmapping for the inflight pages is the > correct DMA API usage? > If it is the correct DMA API usage, how to detect that if DMA unmapping > can be skipped? Once page_pool has allowed a driver to unbind from a device without cleaning up all outstanding DMA mappings made via that device, then it has already leaked those mappings and the damage is done, regardless of whether the effects are visible yet. If you'd really rather play whack-a-mole trying to paper over secondary symptoms of that issue than actually fix it, then fine, but don't expect any driver core/DMA API changes to be acceptable for that purpose. However, if people are hitting those symptoms now, then I'd imagine they're eventually going to come back asking about the ones which can't be papered over, like dma-debug reporting the leaks, or just why their system ends up burning gigabytes on IOMMU pagetables and IOVA kmem_caches. Thanks, Robin.