From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 9CA55C02183
	for <linux-mm@archiver.kernel.org>; Thu, 16 Jan 2025 16:10:01 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2A11D6B0082; Thu, 16 Jan 2025 11:10:01 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 24F3E6B0083; Thu, 16 Jan 2025 11:10:01 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 116F86B0085; Thu, 16 Jan 2025 11:10:01 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id E5D986B0082
	for <linux-mm@kvack.org>; Thu, 16 Jan 2025 11:10:00 -0500 (EST)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 91E3880E29
	for <linux-mm@kvack.org>; Thu, 16 Jan 2025 16:10:00 +0000 (UTC)
X-FDA: 83013801360.21.73B23B1
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf29.hostedemail.com (Postfix) with ESMTP id AAD8A120005
	for <linux-mm@kvack.org>; Thu, 16 Jan 2025 16:09:58 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=sxuT5UpI;
	spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1737043798;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=nbU2HvM5cMiyKV2nbnDA0WskPEuHVdJ/VAhCap09DJ8=;
	b=mVpZB63UtquKjzUQld9naAYN8J2t/lpWJN9BH3FvwVbPWzAcvc9zeYTHgFOkgaDTZVMlUW
	N8buArOC+PuYEyfWoxB5txStIsP6GytvzYJnPsoBfrFSiqct++oV5JzQGXfqCWHUPDe7AE
	715AHMqdCgf4WA55XqM8fsqHWMvnu0M=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=sxuT5UpI;
	spf=pass (imf29.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1737043798; a=rsa-sha256;
	cv=none;
	b=2ea8oErK6wlDiy9DADBjc6g9vHdz0HT9v7hs2CRTO15xnC3tKrPfeIexrhdIshcqgUUZuC
	Yglvww06/xg0UDY3Yeh5TKaODqCK7WOzSoiWExU23dMbHQnsopsVxW+dIGXwlvq46tMM7J
	DU78cmaSn0de7E4vBHUsU3KmCD5Y398=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id CC6F95C5EB6;
	Thu, 16 Jan 2025 16:09:16 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id 11286C4CED6;
	Thu, 16 Jan 2025 16:09:52 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1737043797;
	bh=FqbKc2fY/kv4BYvAVtzqn0MGRJpWCRDnHMTimy8za7Q=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=sxuT5UpIvmPd54H7pV1pDlE+i24u9gdhhGNalpaGB0IoY1cEO9YDsNhnTbt+cCPE9
	 +UZZmlWaxQI64Fz5wvDD4of2tBGjYNcCw+oBvumOxHz71EomCdOsbs/x1QwDoL1k4g
	 v/R9cqv9TrvGJPHBrMgv617wC/T28H5UVk+5cw2hfrMKnR6Ho6mc4Ky4kiUe5ej3cr
	 pIC5dPOO75LkWD8x1oiueuZO46WJNr1MZeSLqv4g9QXbWkV4C5dinK0YYdSq6tzDsC
	 JE9GJtBXCLTYMQcWk0o8KY+/OLWv+mw99t8V44E8ZtF1V3qxeN7rctOr5xkaP9pBK9
	 PP6bBRCpJ0WvA==
Message-ID: <95f258b2-52f5-4a80-a670-b9a182caec7c@kernel.org>
Date: Thu, 16 Jan 2025 17:09:50 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver
 has already unbound
To: Yunsheng Lin <linyunsheng@huawei.com>, davem@davemloft.net,
 kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, liuyonglong@huawei.com, fanghaiqing@huawei.com,
 Robin Murphy <robin.murphy@arm.com>,
 Alexander Duyck <alexander.duyck@gmail.com>, IOMMU <iommu@lists.linux.dev>,
 Andrew Morton <akpm@linux-foundation.org>, Eric Dumazet
 <edumazet@google.com>, Simon Horman <horms@kernel.org>,
 Ilias Apalodimas <ilias.apalodimas@linaro.org>, linux-mm@kvack.org,
 linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
 kernel-team <kernel-team@cloudflare.com>
References: <20250110130703.3814407-1-linyunsheng@huawei.com>
 <20250110130703.3814407-4-linyunsheng@huawei.com>
 <921c827c-41b7-40af-8c01-c21adbe8f41f@kernel.org>
 <2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com>
Content-Language: en-US
From: Jesper Dangaard Brouer <hawk@kernel.org>
In-Reply-To: <2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Queue-Id: AAD8A120005
X-Stat-Signature: m8qa3m7u6trubg8sdjha4z589gktg1ob
X-Rspamd-Server: rspam08
X-Rspam-User: 
X-HE-Tag: 1737043798-304836
X-HE-Meta: U2FsdGVkX194fR7ctH1Wsgbs2B9EeQtkkldO5nG8yNi0sqHwF3afoof/L3bmTJzBKmGDITHhuAGSLKdfbtQJYj45x3WeMnyhzexuv+MRC26k6XNeQAKcDOVwDV4iNJlWNspvThL/QQcr+3M719bRaSsSrCqwVj+2yLASPlSMZc8NgXpwQrkMfpHxxKYszaYxYG5+GG2+8IRjLSLhtY7jLYkyUangpXwjizCpDvf167jScfrWUVsVroqzB3Y/RgTMWCTmZrT8aU7as5tuqefjE4+1VVklhzXV/TP/nTI74hybDENIdPmYomUdmDmh+MWhOPungzDvsD4pbOcHd5n+S6HloO42ipr33BJk77q316MYuOx82PXCMT0MMXdU6p8n15ERtLcQuBKxszQv2u6w3uG2zhZxtf+5Eq/cLMpZljfoWBAJpnuqryM6eWQJihgAnVSVu3MlaMd595rLKmY9ehAyee3hy/I+fU9/+iq4PbIq4y7ULyxTWSMmff/PfaE7t+IJLsBTrltzv/0ZW3LBA8PXuXW8qk5dI4Jsmrujyak/bUQ+g5ReyUwxgrTzzap+B7LPj4RaXfOo7I9k74q6+OgYYJfCl9UJ34eCERZikOYr0aUD99+LUfHvkw4G3Que69M0YhUBdMQJ6TDEloy5LnbGYJkBMSYLchJVxR4aLtbmSp+4eyFl/8cFqsWO144q5vgAPjylzfKlKgG1U6QoYhZYZN4crfDum5W1yybiwNVjlsZm+MaEJGv9tE/SmawAPKpzOFxJPlsckmv3ehn3aGhvL1mF5diRV2TFo8VTDcWFAKkWVmIqSx0WvW14f9VRs2/C55NavNBNnxBnvHR05Hp5iuns9r9wFaI4EX9tLazmOwIUKOfGsaoupjgahLnUXxz6JxTRdGdSbILNpZa8c1Ssz2zpY90sAO1epA7KbQ2z8aaOSiRmJFCpgr5uID2JFjPuAoLY+ZClmQp/0wB
 D3QDS3CC
 rEfNzboKEclzQUu44LgswJnb0s/tbYPPwfFPTXFae0ZvJL/Y/653VMVY9xX8++1MCpMAdOe4OnemNGc0qmghnTwuMTsVmsOpHi0DvObtoWquU3ldmthgBeYvM4RymiVSr13cMcxJA0wOehHYfbsCGgpOLcDVcXy+OSmub/dlSmve+SdVtbiFpCJ+h47YDLBQss4QchbDH1ZIcuMTAoGX9AU4gVomY+ikStKyVMF60gBpDc6xCKW305YP0oQDi5fg+5JXH03M1F3abUce7kSCJlAf04u6xsme+39bsrb1Gxyh6vyij39kL9ypB6q80WzUZ5CzYp02n+JcecsErqJn/1RnCsAykbsMuzRj5nm/eh5ymGExJr42OJsPyyyytw2fiw53ir+G4/l7U5HZJDLCItPu20Q==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 16/01/2025 13.52, Yunsheng Lin wrote:
> On 2025/1/16 0:29, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>> [...]
>>> In order not to call DMA APIs to do DMA unmmapping after driver
>>> has already unbound and stall the unloading of the networking
>>> driver, use some pre-allocated item blocks to record inflight
>>> pages including the ones which are handed over to network stack,
>>> so the page_pool can do the DMA unmmapping for those pages when
>>> page_pool_destroy() is called. As the pre-allocated item blocks
>>> need to be large enough to avoid performance degradation, add a
>>> 'item_fast_empty' stat to indicate the unavailability of the
>>> pre-allocated item blocks.
>>>
>>
> 
> ...
> 
>>> +
>>> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>>> +                             netmem_ref netmem,
>>> +                             bool destroyed)
>>> +{
>>> +    struct page_pool_item *item;
>>> +    dma_addr_t dma;
>>> +
>>> +    if (!pool->dma_map)
>>> +        /* Always account for inflight pages, even if we didn't
>>> +         * map them
>>> +         */
>>> +        return;
>>> +
>>> +    dma = page_pool_get_dma_addr_netmem(netmem);
>>> +    item = netmem_get_pp_item(netmem);
>>> +
>>> +    /* dma unmapping is always needed when page_pool_destory() is not called
>>> +     * yet.
>>> +     */
>>> +    DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
>>> +    if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
>>> +        return;
>>> +
>>> +    /* When page is unmapped, it cannot be returned to our pool */
>>> +    dma_unmap_page_attrs(pool->p.dev, dma,
>>> +                 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
>>> +                 DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
>>> +    page_pool_set_dma_addr_netmem(netmem, 0);
>>> +    page_pool_item_clear_mapped(item);
>>> +}
>>> +
>>
>> I have a hard time reading/reviewing/maintaining below code, without
>> some design description.  This code needs more comments on what is the
>> *intend* and design it's trying to achieve.
>>
>>  From patch description the only hint I have is:
>>   "use some pre-allocated item blocks to record inflight pages"
>>
>> E.g. Why is it needed/smart to hijack the page->pp pointer?
> 
> Mainly because there is no space available for keeping tracking of inflight
> pages, using page->pp can only find the page_pool owning the page, but page_pool
> is not able to keep track of the inflight page when the page is handled by
> networking stack.
> 
> By using page_pool_item as below, the state is used to tell if a specific
> item is being used/dma mapped or not by scanning all the item blocks in
> pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
> will point to that page so that dma unmapping can be done for that page
> when page_pool_destroy() is called, otherwise free items sit in the
> pool->hold_items or pool->release_items by using 'lentry':
> 
> struct page_pool_item {
> 	unsigned long state;
> 	
> 	union {
> 		netmem_ref pp_netmem;
> 		struct llist_node lentry;
> 	};
> };

pahole  -C page_pool_item vmlinux
struct page_pool_item {
	/* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
	 * indicate the state of current item.
	 */	
	long unsigned int          encoded_next;     /*     0     8 */
	union {
		netmem_ref         pp_netmem;        /*     8     8 */
		struct llist_node  lentry;           /*     8     8 */
	};                                           /*     8     8 */

	/* size: 16, cachelines: 1, members: 2 */
	/* last cacheline: 16 bytes */
};


> When a page is added to the page_pool, a item is deleted from pool->hold_items
> or pool->release_items and set the 'pp_netmem' pointing to that page and set
> 'state' accordingly in order to keep track of that page.
> 
> When a page is deleted from the page_pool, it is able to tell which page_pool
> this page belong to by using the below function, and after clearing the 'state',
> the item is added back to pool->release_items so that the item is reused for new
> pages.
> 

To understand below, I'm listing struct page_pool_item_block for other
reviewers:

pahole  -C page_pool_item_block vmlinux
struct page_pool_item_block {
	struct page_pool *         pp;               /*     0     8 */
	struct list_head           list;             /*     8    16 */
	unsigned int               flags;            /*    24     4 */
	refcount_t                 ref;              /*    28     4 */
	struct page_pool_item      items[];          /*    32     0 */

	/* size: 32, cachelines: 1, members: 5 */
	/* last cacheline: 32 bytes */
};

> static inline struct page_pool_item_block *
> page_pool_item_to_block(struct page_pool_item *item)
> {
> 	return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);

This trick requires some comments explaining what is going on!
Please correct me if I'm wrong: Here you a masking off the lower bits of
the pointer to page_pool_item *item, as you know that a struct
page_pool_item_block is stored in the top of a struct page.  This trick
is like a "container_of" for going from page_pool_item to
page_pool_item_block, right?

I do notice that you have a comment above struct page_pool_item_block
(that says "item_block is always PAGE_SIZE"), which is nice, but to be
more explicit/clear:
  I want a big comment block (placed above the main code here) that
explains the design and intention behind this newly invented
"item-block" scheme, like e.g. the connection between
page_pool_item_block and page_pool_item. Like the advantage/trick that
allows page->pp pointer to be an "item" and be mapped back to a "block"
to find the page_pool object it belongs to.  Don't write *what* the code
does, but write about the intended purpose and design reasons behind the
code.


> }
> 
>   static inline struct page_pool *page_pool_get_pp(struct page *page)
>   {
>        return page_pool_item_to_block(page->pp_item)->pp;
>   }
> 
> 
>>
>>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>>> +{
>>
>> Function name is confusing.  First I though this was init'ing a single
>> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
>>
>> Maybe it should be called page_pool_item_block_init ?
> 
> The __page_pool_item_init() is added to make the below
> page_pool_item_init() function more readable or maintainable, changing
> it to page_pool_item_block_init doesn't seems consistent?

You (of-cause) also have to rename the other function, I though that was
implicitly understood.

BUT does my suggested rename make sense?  What I'm seeing is that all
the *items* in the "block" is getting inited. But we are also setting up
the "block" (e.g.  "block->pp=pool").

>>
>>> +    struct page_pool_item_block *block = page_address(page);
>>> +    struct page_pool_item *items = block->items;
>>> +    unsigned int i;
>>> +
>>> +    list_add(&block->list, &pool->item_blocks);
>>> +    block->pp = pool;
>>> +
>>> +    for (i = 0; i < ITEMS_PER_PAGE; i++) {
>>> +        page_pool_item_init_state(&items[i]);
>>> +        __llist_add(&items[i].lentry, &pool->hold_items);
>>> +    }
>>> +}
>>> +
>>> +static int page_pool_item_init(struct page_pool *pool)
>>> +{
>>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS        512
>>> +    struct page_pool_item_block *block;
>>> +    int item_cnt;
>>> +
>>> +    INIT_LIST_HEAD(&pool->item_blocks);
>>> +    init_llist_head(&pool->hold_items);
>>> +    init_llist_head(&pool->release_items);
>>> +
>>> +    item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
>>> +        PAGE_POOL_MIN_INFLIGHT_ITEMS;
>>> +    while (item_cnt > 0) {
>>> +        struct page *page;
>>> +
>>> +        page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
>>> +        if (!page)
>>> +            goto err;
>>> +
>>> +        __page_pool_item_init(pool, page);
>>> +        item_cnt -= ITEMS_PER_PAGE;
>>> +    }
>>> +
>>> +    return 0;
>>> +err:
>>> +    list_for_each_entry(block, &pool->item_blocks, list)
>>> +        put_page(virt_to_page(block));
> 
> This one also have used-after-free problem as the page_pool_item_uninit
> in the previous version.
> 
>>> +
>>> +    return -ENOMEM;
>>> +}
>>> +
>