From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9CA55C02183 for ; Thu, 16 Jan 2025 16:10:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2A11D6B0082; Thu, 16 Jan 2025 11:10:01 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 24F3E6B0083; Thu, 16 Jan 2025 11:10:01 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 116F86B0085; Thu, 16 Jan 2025 11:10:01 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id E5D986B0082 for ; Thu, 16 Jan 2025 11:10:00 -0500 (EST) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 91E3880E29 for ; Thu, 16 Jan 2025 16:10:00 +0000 (UTC) X-FDA: 83013801360.21.73B23B1 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by imf29.hostedemail.com (Postfix) with ESMTP id AAD8A120005 for ; Thu, 16 Jan 2025 16:09:58 +0000 (UTC) Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=sxuT5UpI; spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1737043798; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nbU2HvM5cMiyKV2nbnDA0WskPEuHVdJ/VAhCap09DJ8=; b=mVpZB63UtquKjzUQld9naAYN8J2t/lpWJN9BH3FvwVbPWzAcvc9zeYTHgFOkgaDTZVMlUW N8buArOC+PuYEyfWoxB5txStIsP6GytvzYJnPsoBfrFSiqct++oV5JzQGXfqCWHUPDe7AE 715AHMqdCgf4WA55XqM8fsqHWMvnu0M= ARC-Authentication-Results: i=1; imf29.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=sxuT5UpI; spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=quarantine) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737043798; a=rsa-sha256; cv=none; b=2ea8oErK6wlDiy9DADBjc6g9vHdz0HT9v7hs2CRTO15xnC3tKrPfeIexrhdIshcqgUUZuC Yglvww06/xg0UDY3Yeh5TKaODqCK7WOzSoiWExU23dMbHQnsopsVxW+dIGXwlvq46tMM7J DU78cmaSn0de7E4vBHUsU3KmCD5Y398= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id CC6F95C5EB6; Thu, 16 Jan 2025 16:09:16 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 11286C4CED6; Thu, 16 Jan 2025 16:09:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1737043797; bh=FqbKc2fY/kv4BYvAVtzqn0MGRJpWCRDnHMTimy8za7Q=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=sxuT5UpIvmPd54H7pV1pDlE+i24u9gdhhGNalpaGB0IoY1cEO9YDsNhnTbt+cCPE9 +UZZmlWaxQI64Fz5wvDD4of2tBGjYNcCw+oBvumOxHz71EomCdOsbs/x1QwDoL1k4g v/R9cqv9TrvGJPHBrMgv617wC/T28H5UVk+5cw2hfrMKnR6Ho6mc4Ky4kiUe5ej3cr pIC5dPOO75LkWD8x1oiueuZO46WJNr1MZeSLqv4g9QXbWkV4C5dinK0YYdSq6tzDsC JE9GJtBXCLTYMQcWk0o8KY+/OLWv+mw99t8V44E8ZtF1V3qxeN7rctOr5xkaP9pBK9 PP6bBRCpJ0WvA== Message-ID: <95f258b2-52f5-4a80-a670-b9a182caec7c@kernel.org> Date: Thu, 16 Jan 2025 17:09:50 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound To: Yunsheng Lin , davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com Cc: zhangkun09@huawei.com, liuyonglong@huawei.com, fanghaiqing@huawei.com, Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Simon Horman , Ilias Apalodimas , linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kernel-team References: <20250110130703.3814407-1-linyunsheng@huawei.com> <20250110130703.3814407-4-linyunsheng@huawei.com> <921c827c-41b7-40af-8c01-c21adbe8f41f@kernel.org> <2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: <2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: AAD8A120005 X-Stat-Signature: m8qa3m7u6trubg8sdjha4z589gktg1ob X-Rspamd-Server: rspam08 X-Rspam-User: X-HE-Tag: 1737043798-304836 X-HE-Meta: U2FsdGVkX194fR7ctH1Wsgbs2B9EeQtkkldO5nG8yNi0sqHwF3afoof/L3bmTJzBKmGDITHhuAGSLKdfbtQJYj45x3WeMnyhzexuv+MRC26k6XNeQAKcDOVwDV4iNJlWNspvThL/QQcr+3M719bRaSsSrCqwVj+2yLASPlSMZc8NgXpwQrkMfpHxxKYszaYxYG5+GG2+8IRjLSLhtY7jLYkyUangpXwjizCpDvf167jScfrWUVsVroqzB3Y/RgTMWCTmZrT8aU7as5tuqefjE4+1VVklhzXV/TP/nTI74hybDENIdPmYomUdmDmh+MWhOPungzDvsD4pbOcHd5n+S6HloO42ipr33BJk77q316MYuOx82PXCMT0MMXdU6p8n15ERtLcQuBKxszQv2u6w3uG2zhZxtf+5Eq/cLMpZljfoWBAJpnuqryM6eWQJihgAnVSVu3MlaMd595rLKmY9ehAyee3hy/I+fU9/+iq4PbIq4y7ULyxTWSMmff/PfaE7t+IJLsBTrltzv/0ZW3LBA8PXuXW8qk5dI4Jsmrujyak/bUQ+g5ReyUwxgrTzzap+B7LPj4RaXfOo7I9k74q6+OgYYJfCl9UJ34eCERZikOYr0aUD99+LUfHvkw4G3Que69M0YhUBdMQJ6TDEloy5LnbGYJkBMSYLchJVxR4aLtbmSp+4eyFl/8cFqsWO144q5vgAPjylzfKlKgG1U6QoYhZYZN4crfDum5W1yybiwNVjlsZm+MaEJGv9tE/SmawAPKpzOFxJPlsckmv3ehn3aGhvL1mF5diRV2TFo8VTDcWFAKkWVmIqSx0WvW14f9VRs2/C55NavNBNnxBnvHR05Hp5iuns9r9wFaI4EX9tLazmOwIUKOfGsaoupjgahLnUXxz6JxTRdGdSbILNpZa8c1Ssz2zpY90sAO1epA7KbQ2z8aaOSiRmJFCpgr5uID2JFjPuAoLY+ZClmQp/0wB D3QDS3CC rEfNzboKEclzQUu44LgswJnb0s/tbYPPwfFPTXFae0ZvJL/Y/653VMVY9xX8++1MCpMAdOe4OnemNGc0qmghnTwuMTsVmsOpHi0DvObtoWquU3ldmthgBeYvM4RymiVSr13cMcxJA0wOehHYfbsCGgpOLcDVcXy+OSmub/dlSmve+SdVtbiFpCJ+h47YDLBQss4QchbDH1ZIcuMTAoGX9AU4gVomY+ikStKyVMF60gBpDc6xCKW305YP0oQDi5fg+5JXH03M1F3abUce7kSCJlAf04u6xsme+39bsrb1Gxyh6vyij39kL9ypB6q80WzUZ5CzYp02n+JcecsErqJn/1RnCsAykbsMuzRj5nm/eh5ymGExJr42OJsPyyyytw2fiw53ir+G4/l7U5HZJDLCItPu20Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/01/2025 13.52, Yunsheng Lin wrote: > On 2025/1/16 0:29, Jesper Dangaard Brouer wrote: >> >> >> On 10/01/2025 14.06, Yunsheng Lin wrote: >> [...] >>> In order not to call DMA APIs to do DMA unmmapping after driver >>> has already unbound and stall the unloading of the networking >>> driver, use some pre-allocated item blocks to record inflight >>> pages including the ones which are handed over to network stack, >>> so the page_pool can do the DMA unmmapping for those pages when >>> page_pool_destroy() is called. As the pre-allocated item blocks >>> need to be large enough to avoid performance degradation, add a >>> 'item_fast_empty' stat to indicate the unavailability of the >>> pre-allocated item blocks. >>> >> > > ... > >>> + >>> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool, >>> +                             netmem_ref netmem, >>> +                             bool destroyed) >>> +{ >>> +    struct page_pool_item *item; >>> +    dma_addr_t dma; >>> + >>> +    if (!pool->dma_map) >>> +        /* Always account for inflight pages, even if we didn't >>> +         * map them >>> +         */ >>> +        return; >>> + >>> +    dma = page_pool_get_dma_addr_netmem(netmem); >>> +    item = netmem_get_pp_item(netmem); >>> + >>> +    /* dma unmapping is always needed when page_pool_destory() is not called >>> +     * yet. >>> +     */ >>> +    DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item)); >>> +    if (unlikely(destroyed && !page_pool_item_is_mapped(item))) >>> +        return; >>> + >>> +    /* When page is unmapped, it cannot be returned to our pool */ >>> +    dma_unmap_page_attrs(pool->p.dev, dma, >>> +                 PAGE_SIZE << pool->p.order, pool->p.dma_dir, >>> +                 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING); >>> +    page_pool_set_dma_addr_netmem(netmem, 0); >>> +    page_pool_item_clear_mapped(item); >>> +} >>> + >> >> I have a hard time reading/reviewing/maintaining below code, without >> some design description.  This code needs more comments on what is the >> *intend* and design it's trying to achieve. >> >> From patch description the only hint I have is: >>  "use some pre-allocated item blocks to record inflight pages" >> >> E.g. Why is it needed/smart to hijack the page->pp pointer? > > Mainly because there is no space available for keeping tracking of inflight > pages, using page->pp can only find the page_pool owning the page, but page_pool > is not able to keep track of the inflight page when the page is handled by > networking stack. > > By using page_pool_item as below, the state is used to tell if a specific > item is being used/dma mapped or not by scanning all the item blocks in > pool->item_blocks. If a specific item is used by a page, then 'pp_netmem' > will point to that page so that dma unmapping can be done for that page > when page_pool_destroy() is called, otherwise free items sit in the > pool->hold_items or pool->release_items by using 'lentry': > > struct page_pool_item { > unsigned long state; > > union { > netmem_ref pp_netmem; > struct llist_node lentry; > }; > }; pahole -C page_pool_item vmlinux struct page_pool_item { /* An 'encoded_next' is a pointer to next item, lower 2 bits is used to * indicate the state of current item. */ long unsigned int encoded_next; /* 0 8 */ union { netmem_ref pp_netmem; /* 8 8 */ struct llist_node lentry; /* 8 8 */ }; /* 8 8 */ /* size: 16, cachelines: 1, members: 2 */ /* last cacheline: 16 bytes */ }; > When a page is added to the page_pool, a item is deleted from pool->hold_items > or pool->release_items and set the 'pp_netmem' pointing to that page and set > 'state' accordingly in order to keep track of that page. > > When a page is deleted from the page_pool, it is able to tell which page_pool > this page belong to by using the below function, and after clearing the 'state', > the item is added back to pool->release_items so that the item is reused for new > pages. > To understand below, I'm listing struct page_pool_item_block for other reviewers: pahole -C page_pool_item_block vmlinux struct page_pool_item_block { struct page_pool * pp; /* 0 8 */ struct list_head list; /* 8 16 */ unsigned int flags; /* 24 4 */ refcount_t ref; /* 28 4 */ struct page_pool_item items[]; /* 32 0 */ /* size: 32, cachelines: 1, members: 5 */ /* last cacheline: 32 bytes */ }; > static inline struct page_pool_item_block * > page_pool_item_to_block(struct page_pool_item *item) > { > return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK); This trick requires some comments explaining what is going on! Please correct me if I'm wrong: Here you a masking off the lower bits of the pointer to page_pool_item *item, as you know that a struct page_pool_item_block is stored in the top of a struct page. This trick is like a "container_of" for going from page_pool_item to page_pool_item_block, right? I do notice that you have a comment above struct page_pool_item_block (that says "item_block is always PAGE_SIZE"), which is nice, but to be more explicit/clear: I want a big comment block (placed above the main code here) that explains the design and intention behind this newly invented "item-block" scheme, like e.g. the connection between page_pool_item_block and page_pool_item. Like the advantage/trick that allows page->pp pointer to be an "item" and be mapped back to a "block" to find the page_pool object it belongs to. Don't write *what* the code does, but write about the intended purpose and design reasons behind the code. > } > > static inline struct page_pool *page_pool_get_pp(struct page *page) > { > return page_pool_item_to_block(page->pp_item)->pp; > } > > >> >>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page) >>> +{ >> >> Function name is confusing.  First I though this was init'ing a single >> item, but looking at the code it is iterating over ITEMS_PER_PAGE. >> >> Maybe it should be called page_pool_item_block_init ? > > The __page_pool_item_init() is added to make the below > page_pool_item_init() function more readable or maintainable, changing > it to page_pool_item_block_init doesn't seems consistent? You (of-cause) also have to rename the other function, I though that was implicitly understood. BUT does my suggested rename make sense? What I'm seeing is that all the *items* in the "block" is getting inited. But we are also setting up the "block" (e.g. "block->pp=pool"). >> >>> +    struct page_pool_item_block *block = page_address(page); >>> +    struct page_pool_item *items = block->items; >>> +    unsigned int i; >>> + >>> +    list_add(&block->list, &pool->item_blocks); >>> +    block->pp = pool; >>> + >>> +    for (i = 0; i < ITEMS_PER_PAGE; i++) { >>> +        page_pool_item_init_state(&items[i]); >>> +        __llist_add(&items[i].lentry, &pool->hold_items); >>> +    } >>> +} >>> + >>> +static int page_pool_item_init(struct page_pool *pool) >>> +{ >>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS        512 >>> +    struct page_pool_item_block *block; >>> +    int item_cnt; >>> + >>> +    INIT_LIST_HEAD(&pool->item_blocks); >>> +    init_llist_head(&pool->hold_items); >>> +    init_llist_head(&pool->release_items); >>> + >>> +    item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE + >>> +        PAGE_POOL_MIN_INFLIGHT_ITEMS; >>> +    while (item_cnt > 0) { >>> +        struct page *page; >>> + >>> +        page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0); >>> +        if (!page) >>> +            goto err; >>> + >>> +        __page_pool_item_init(pool, page); >>> +        item_cnt -= ITEMS_PER_PAGE; >>> +    } >>> + >>> +    return 0; >>> +err: >>> +    list_for_each_entry(block, &pool->item_blocks, list) >>> +        put_page(virt_to_page(block)); > > This one also have used-after-free problem as the page_pool_item_uninit > in the previous version. > >>> + >>> +    return -ENOMEM; >>> +} >>> + >