From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C048CFA3737 for ; Fri, 13 Sep 2024 09:36:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C25A6B009A; Fri, 13 Sep 2024 05:36:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 171826B009E; Fri, 13 Sep 2024 05:36:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 011226B00C2; Fri, 13 Sep 2024 05:36:00 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id D60116B009A for ; Fri, 13 Sep 2024 05:36:00 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 652FD80915 for ; Fri, 13 Sep 2024 09:36:00 +0000 (UTC) X-FDA: 82559208480.14.E5D8096 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) by imf03.hostedemail.com (Postfix) with ESMTP id EA5A320004 for ; Fri, 13 Sep 2024 09:35:56 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf03.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726220129; a=rsa-sha256; cv=none; b=4wWOAV2Edgno3h8S2PfGW5iob/YBRn19yeXjTO2azDWQTlXnj04eMalnyZ7ZendH6CyEub 3OJO5nnXIamWD1FkoFaXWnjgOZIBYR7fCeO5AjTDZ8LBArXOrYN9Yr+uWcG8kXFMFGrSu/ 0uSt9B5adj72sRpQeAQVgoeyNRuI7t4= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf03.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726220129; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=g4RqOsNL6IvtloUXZrR1fdZjlsz7yR52eaJwMIl6Ajk=; b=RvxqCYW2vIBpWpXGK485kwj2jSu+L5zbv9hTwH5NB7pM2gd6o1kgqUNkizroJz1dgkkkyG dzHrUoijF4lrmuNgfn5NCiZVePHMjip+1OtxHd6Qi01VHcRKTDEJXsNvCzcK91/i/JkOG3 Bfoo0GE1UWnkOtQQjMa906pLga7XNSk= Received: from mail.maildlp.com (unknown [172.19.162.112]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4X4q0m4b9zz2CpYn; Fri, 13 Sep 2024 17:35:20 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id 79E181402CE; Fri, 13 Sep 2024 17:35:52 +0800 (CST) Received: from [10.67.120.129] (10.67.120.129) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Fri, 13 Sep 2024 17:35:51 +0800 Message-ID: <00c41a5b-6a5b-4ee1-b0e2-eae819e3cf3b@huawei.com> Date: Fri, 13 Sep 2024 17:35:51 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 2/2] page_pool: fix IOMMU crash when driver has already unbound To: Mina Almasry CC: , , , , , Robin Murphy , Alexander Duyck , IOMMU , Wei Fang , Shenwei Wang , Clark Wang , Eric Dumazet , Tony Nguyen , Przemek Kitszel , Alexander Lobakin , Alexei Starovoitov , Daniel Borkmann , Jesper Dangaard Brouer , John Fastabend , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Felix Fietkau , Lorenzo Bianconi , Ryder Lee , Shayne Chen , Sean Wang , Kalle Valo , Matthias Brugger , AngeloGioacchino Del Regno , Andrew Morton , Ilias Apalodimas , , , , , , , , , , References: <20240912124514.2329991-1-linyunsheng@huawei.com> <20240912124514.2329991-3-linyunsheng@huawei.com> Content-Language: en-US From: Yunsheng Lin In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit X-Originating-IP: [10.67.120.129] X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspam-User: X-Rspamd-Queue-Id: EA5A320004 X-Rspamd-Server: rspam01 X-Stat-Signature: 1sygdorkdga115n79oyzczj97uqksfir X-HE-Tag: 1726220156-747496 X-HE-Meta: U2FsdGVkX1+F/spkisADZYooFFnS2ftR44eVO6PQFhNUiMztj/40eipiZw/u4dHeL/Xof2uIUO3CALmP9Mc0Pom1S3f45tJxmPa2GWa9giNK/rgSf53Ui8TWbNwFhbUMLmeNRUK1P1UAW6Y//gzzsBRN7Gqpu7YOPCrAHhFWcvxzMjM+G1Q8wOC2upWkROL+uvmezsW5PrKfthoh3J2GIkIB0ABB2Cl4icektQFPMi2qDcIt+SJhQ8/If67aVUN2xJJi3SQ9TzEMX6jC48qAMGYJiFuVJba7tKS2mKjR3SVagUICu3R6TJmhzcWH/mu7GVVJGPFrOIrJ9t7+77h7RDopqTuue4mhvgE8x1GAR/Rrhvsf3APVf9TIJe5iValBuJpJvaoF2ypAE+0+WVt0a1AJV6Mf3J4vdVlcXtXlWo9yL/2lkOCESIk2hechJ5bvhbz9A7IFFRu6wnIYgeNFr3txLrhKikjRIEYK+1+OMyR1j5fbG1c5ObpdffzcMFuzAfQHBfx0Wy/gwQzlXPD2JLMSzpXNwriTaRVglAcl3eMAJ5vRK+C2xLu0SRIMt5Mt328NEJgXvFdHkZTxQjx7bgEtw7HD55uYOUCsPp5+7UzXHL7dMcAuvRcJmKV5G1FjVaXbVR8o8xlyqJLeNqSY15h7b7+pntNqQych6kuyLMRp8lwY0p53WGgFhL2eEN+BGiTmCWkQqZGlUqWcqFINVU9Ggasw4wOrb1mtfXvv4ikaW8FZvHpFzQt5ge3PEuPGMj5w9KNDE6kce8fPC8GpJWTrl7HACRHIYkYHhQRn9+8E0qdwxqSgIMqNtXbYHrthblUQlvNvdDyglFsc2RGW0nHe4f7QLEPFvlX3yMnAJTcFyZ0y09B+EL0pqa0MLFDHZodfDapwtguuTIjQFxFt8BQ1AeuZC2jruEAGavP/fgzuS1gNAkMOsTTais4OqL2Spl/AFhqG9VfFJEjzvkq YuxY38ZY XZyoCO2p3pJa3j2YQtxNalOommcECknbXZGFRYv1869JSOqpBbdT5czjzqyXH5WnfSgwgoTY3jPBrEMjowcGB/FewSQ90LLtNOUUbto8tLLElLHcMUSce9BX+5pyQ499Z6ru9dqx7cyb3uj1fgRQMiuErVSPA3z489PymrO9H9bVqWbFEirEIs299Rl+Lxsr+798JTQW9gR+MtrHu1/gwKga1mJCC3ZiiXN9q4CeM4SVZNfD2SzFfBidU1ORP3JyHOLWTcNSTM7pWV+BhJdee0O7mwTli/BStPtMZdyya1szJvFnhLXVGyByy983CEY2xFTNwNjeJaXlQxS6P7MshlhKDcg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2024/9/12 22:25, Mina Almasry wrote: > On Thu, Sep 12, 2024 at 5:51 AM Yunsheng Lin wrote: >> >> Networking driver with page_pool support may hand over page >> still with dma mapping to network stack and try to reuse that >> page after network stack is done with it and passes it back >> to page_pool to avoid the penalty of dma mapping/unmapping. >> With all the caching in the network stack, some pages may be >> held in the network stack without returning to the page_pool >> soon enough, and with VF disable causing the driver unbound, >> the page_pool does not stop the driver from doing it's >> unbounding work, instead page_pool uses workqueue to check >> if there is some pages coming back from the network stack >> periodically, if there is any, it will do the dma unmmapping >> related cleanup work. >> >> As mentioned in [1], attempting DMA unmaps after the driver >> has already unbound may leak resources or at worst corrupt >> memory. Fundamentally, the page pool code cannot allow DMA >> mappings to outlive the driver they belong to. >> >> Currently it seems there are at least two cases that the page >> is not released fast enough causing dma unmmapping done after >> driver has already unbound: >> 1. ipv4 packet defragmentation timeout: this seems to cause >> delay up to 30 secs: >> >> 2. skb_defer_free_flush(): this may cause infinite delay if >> there is no triggering for net_rx_action(). >> >> In order not to do the dma unmmapping after driver has already >> unbound and stall the unloading of the networking driver, add >> the pool->items array to record all the pages including the ones >> which are handed over to network stack, so the page_pool can >> do the dma unmmapping for those pages when page_pool_destroy() >> is called. >> > > The approach in this patch is a bit complicated. I wonder if there is > something simpler that we can do. From reading the thread, it seems > the issue is that in __page_pool_release_page_dma we're calling > dma_unmap_page_attrs() on a pool->p.dev that has been deleted via > device_del, right? > > Why not consider pool->p.dev unusable if pool->destroy_cnt > 0? I.e. > in __page_pool_release_page_dma, we can skip dma_unmap_page_attrs() if > destry_cnt > 0? The skipping is already done for __dma_sync_single_for_device() in this patch, but not for dma_unmap_page_attrs(), see the clearing of dma_sync in page_pool_destroy(). > > More generally, probably any use of pool->p.dev may be invalid if > page_pool_destroy has been called. The call sites can be scrubbed for > latent bugs. As mentioned in commit log, attempting DMA unmaps after the driver has already unbound may leak resources or at worst corrupt memory. The skipping only avoid corrupting memory, but not leaking resources, as there may be bounce buffer or iova resources before the dma mapping as my understanding. And when page_pool_destroy() is called, there is currently no way to tell if the device is going to be invalid or not. > > The hard part is handling the concurrency. I'm not so sure we can fix > this without introducing some synchronization between the > page_pool_destroy seeing the device go away and the code paths using > the device. Are these being called from the fast paths? Jespers > benchmark can tell for sure if there is any impact on the fast path. It depends on what your definition of fast path here, there are three kinds of fast path for page pool as my understanding. For the first and second fast path, the synchronization cost is the rcu read lock overhead introduced in patch 1. For the third fast path when we need to refill from or flush to the page allocator, the added overhead is the page_pool_item_add() and page_pool_item_del() calling as my understanding. Will provide the performance data in the next version. > >> Note, the devmem patchset seems to make the bug harder to fix >> and to backport too, this patch does not consider fixing the >> case for devmem yet. >> > > FWIW from a quick look I did not see anything in this patch that is > extremely hard to port to netmem. AFAICT the issue is that you skipped > changing page_pool to page_pool_items in net_iov. Once that is done, I > think the rest should be straightforward. Does page_pool_item_uninit() need to handle 'netmem' case? How does the devmem handle the case net_iov being cached in network stack when the driver has already unbound? I am supposing there is a device for dma_buf_unmap_attachment_unlocked(), and is it possible that the device become invalid too when the driver has unbound? If yes, doesn't it have a similar problem as the case of normal page? >