From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Yunsheng Lin <linyunsheng@huawei.com>,
davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, liuyonglong@huawei.com,
fanghaiqing@huawei.com, Robin Murphy <robin.murphy@arm.com>,
Alexander Duyck <alexander.duyck@gmail.com>,
IOMMU <iommu@lists.linux.dev>,
Andrew Morton <akpm@linux-foundation.org>,
Eric Dumazet <edumazet@google.com>,
Simon Horman <horms@kernel.org>,
Ilias Apalodimas <ilias.apalodimas@linaro.org>,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
netdev@vger.kernel.org, kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound
Date: Thu, 16 Jan 2025 17:09:50 +0100 [thread overview]
Message-ID: <95f258b2-52f5-4a80-a670-b9a182caec7c@kernel.org> (raw)
In-Reply-To: <2b5a58f3-d67a-4bf7-921a-033326958ac6@huawei.com>
On 16/01/2025 13.52, Yunsheng Lin wrote:
> On 2025/1/16 0:29, Jesper Dangaard Brouer wrote:
>>
>>
>> On 10/01/2025 14.06, Yunsheng Lin wrote:
>> [...]
>>> In order not to call DMA APIs to do DMA unmmapping after driver
>>> has already unbound and stall the unloading of the networking
>>> driver, use some pre-allocated item blocks to record inflight
>>> pages including the ones which are handed over to network stack,
>>> so the page_pool can do the DMA unmmapping for those pages when
>>> page_pool_destroy() is called. As the pre-allocated item blocks
>>> need to be large enough to avoid performance degradation, add a
>>> 'item_fast_empty' stat to indicate the unavailability of the
>>> pre-allocated item blocks.
>>>
>>
>
> ...
>
>>> +
>>> +static __always_inline void __page_pool_release_page_dma(struct page_pool *pool,
>>> + netmem_ref netmem,
>>> + bool destroyed)
>>> +{
>>> + struct page_pool_item *item;
>>> + dma_addr_t dma;
>>> +
>>> + if (!pool->dma_map)
>>> + /* Always account for inflight pages, even if we didn't
>>> + * map them
>>> + */
>>> + return;
>>> +
>>> + dma = page_pool_get_dma_addr_netmem(netmem);
>>> + item = netmem_get_pp_item(netmem);
>>> +
>>> + /* dma unmapping is always needed when page_pool_destory() is not called
>>> + * yet.
>>> + */
>>> + DEBUG_NET_WARN_ON_ONCE(!destroyed && !page_pool_item_is_mapped(item));
>>> + if (unlikely(destroyed && !page_pool_item_is_mapped(item)))
>>> + return;
>>> +
>>> + /* When page is unmapped, it cannot be returned to our pool */
>>> + dma_unmap_page_attrs(pool->p.dev, dma,
>>> + PAGE_SIZE << pool->p.order, pool->p.dma_dir,
>>> + DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING);
>>> + page_pool_set_dma_addr_netmem(netmem, 0);
>>> + page_pool_item_clear_mapped(item);
>>> +}
>>> +
>>
>> I have a hard time reading/reviewing/maintaining below code, without
>> some design description. This code needs more comments on what is the
>> *intend* and design it's trying to achieve.
>>
>> From patch description the only hint I have is:
>> "use some pre-allocated item blocks to record inflight pages"
>>
>> E.g. Why is it needed/smart to hijack the page->pp pointer?
>
> Mainly because there is no space available for keeping tracking of inflight
> pages, using page->pp can only find the page_pool owning the page, but page_pool
> is not able to keep track of the inflight page when the page is handled by
> networking stack.
>
> By using page_pool_item as below, the state is used to tell if a specific
> item is being used/dma mapped or not by scanning all the item blocks in
> pool->item_blocks. If a specific item is used by a page, then 'pp_netmem'
> will point to that page so that dma unmapping can be done for that page
> when page_pool_destroy() is called, otherwise free items sit in the
> pool->hold_items or pool->release_items by using 'lentry':
>
> struct page_pool_item {
> unsigned long state;
>
> union {
> netmem_ref pp_netmem;
> struct llist_node lentry;
> };
> };
pahole -C page_pool_item vmlinux
struct page_pool_item {
/* An 'encoded_next' is a pointer to next item, lower 2 bits is used to
* indicate the state of current item.
*/
long unsigned int encoded_next; /* 0 8 */
union {
netmem_ref pp_netmem; /* 8 8 */
struct llist_node lentry; /* 8 8 */
}; /* 8 8 */
/* size: 16, cachelines: 1, members: 2 */
/* last cacheline: 16 bytes */
};
> When a page is added to the page_pool, a item is deleted from pool->hold_items
> or pool->release_items and set the 'pp_netmem' pointing to that page and set
> 'state' accordingly in order to keep track of that page.
>
> When a page is deleted from the page_pool, it is able to tell which page_pool
> this page belong to by using the below function, and after clearing the 'state',
> the item is added back to pool->release_items so that the item is reused for new
> pages.
>
To understand below, I'm listing struct page_pool_item_block for other
reviewers:
pahole -C page_pool_item_block vmlinux
struct page_pool_item_block {
struct page_pool * pp; /* 0 8 */
struct list_head list; /* 8 16 */
unsigned int flags; /* 24 4 */
refcount_t ref; /* 28 4 */
struct page_pool_item items[]; /* 32 0 */
/* size: 32, cachelines: 1, members: 5 */
/* last cacheline: 32 bytes */
};
> static inline struct page_pool_item_block *
> page_pool_item_to_block(struct page_pool_item *item)
> {
> return (struct page_pool_item_block *)((unsigned long)item & PAGE_MASK);
This trick requires some comments explaining what is going on!
Please correct me if I'm wrong: Here you a masking off the lower bits of
the pointer to page_pool_item *item, as you know that a struct
page_pool_item_block is stored in the top of a struct page. This trick
is like a "container_of" for going from page_pool_item to
page_pool_item_block, right?
I do notice that you have a comment above struct page_pool_item_block
(that says "item_block is always PAGE_SIZE"), which is nice, but to be
more explicit/clear:
I want a big comment block (placed above the main code here) that
explains the design and intention behind this newly invented
"item-block" scheme, like e.g. the connection between
page_pool_item_block and page_pool_item. Like the advantage/trick that
allows page->pp pointer to be an "item" and be mapped back to a "block"
to find the page_pool object it belongs to. Don't write *what* the code
does, but write about the intended purpose and design reasons behind the
code.
> }
>
> static inline struct page_pool *page_pool_get_pp(struct page *page)
> {
> return page_pool_item_to_block(page->pp_item)->pp;
> }
>
>
>>
>>> +static void __page_pool_item_init(struct page_pool *pool, struct page *page)
>>> +{
>>
>> Function name is confusing. First I though this was init'ing a single
>> item, but looking at the code it is iterating over ITEMS_PER_PAGE.
>>
>> Maybe it should be called page_pool_item_block_init ?
>
> The __page_pool_item_init() is added to make the below
> page_pool_item_init() function more readable or maintainable, changing
> it to page_pool_item_block_init doesn't seems consistent?
You (of-cause) also have to rename the other function, I though that was
implicitly understood.
BUT does my suggested rename make sense? What I'm seeing is that all
the *items* in the "block" is getting inited. But we are also setting up
the "block" (e.g. "block->pp=pool").
>>
>>> + struct page_pool_item_block *block = page_address(page);
>>> + struct page_pool_item *items = block->items;
>>> + unsigned int i;
>>> +
>>> + list_add(&block->list, &pool->item_blocks);
>>> + block->pp = pool;
>>> +
>>> + for (i = 0; i < ITEMS_PER_PAGE; i++) {
>>> + page_pool_item_init_state(&items[i]);
>>> + __llist_add(&items[i].lentry, &pool->hold_items);
>>> + }
>>> +}
>>> +
>>> +static int page_pool_item_init(struct page_pool *pool)
>>> +{
>>> +#define PAGE_POOL_MIN_INFLIGHT_ITEMS 512
>>> + struct page_pool_item_block *block;
>>> + int item_cnt;
>>> +
>>> + INIT_LIST_HEAD(&pool->item_blocks);
>>> + init_llist_head(&pool->hold_items);
>>> + init_llist_head(&pool->release_items);
>>> +
>>> + item_cnt = pool->p.pool_size * 2 + PP_ALLOC_CACHE_SIZE +
>>> + PAGE_POOL_MIN_INFLIGHT_ITEMS;
>>> + while (item_cnt > 0) {
>>> + struct page *page;
>>> +
>>> + page = alloc_pages_node(pool->p.nid, GFP_KERNEL, 0);
>>> + if (!page)
>>> + goto err;
>>> +
>>> + __page_pool_item_init(pool, page);
>>> + item_cnt -= ITEMS_PER_PAGE;
>>> + }
>>> +
>>> + return 0;
>>> +err:
>>> + list_for_each_entry(block, &pool->item_blocks, list)
>>> + put_page(virt_to_page(block));
>
> This one also have used-after-free problem as the page_pool_item_uninit
> in the previous version.
>
>>> +
>>> + return -ENOMEM;
>>> +}
>>> +
>
next prev parent reply other threads:[~2025-01-16 16:10 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-10 13:06 [PATCH net-next v7 0/8] fix two bugs related to page_pool Yunsheng Lin
2025-01-10 13:06 ` [PATCH net-next v7 3/8] page_pool: fix IOMMU crash when driver has already unbound Yunsheng Lin
2025-01-15 16:29 ` Jesper Dangaard Brouer
2025-01-16 12:52 ` Yunsheng Lin
2025-01-16 16:09 ` Jesper Dangaard Brouer [this message]
2025-01-17 11:56 ` Yunsheng Lin
2025-01-17 16:56 ` Jesper Dangaard Brouer
2025-01-18 13:36 ` Yunsheng Lin
2025-01-14 14:31 ` [PATCH net-next v7 0/8] fix two bugs related to page_pool Jesper Dangaard Brouer
2025-01-15 11:33 ` Yunsheng Lin
2025-01-15 17:40 ` Jesper Dangaard Brouer
2025-01-16 12:52 ` Yunsheng Lin
2025-01-16 18:02 ` Jesper Dangaard Brouer
2025-01-17 11:35 ` Yunsheng Lin
2025-01-18 8:04 ` Jesper Dangaard Brouer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=95f258b2-52f5-4a80-a670-b9a182caec7c@kernel.org \
--to=hawk@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=alexander.duyck@gmail.com \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=fanghaiqing@huawei.com \
--cc=horms@kernel.org \
--cc=ilias.apalodimas@linaro.org \
--cc=iommu@lists.linux.dev \
--cc=kernel-team@cloudflare.com \
--cc=kuba@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linyunsheng@huawei.com \
--cc=liuyonglong@huawei.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=robin.murphy@arm.com \
--cc=zhangkun09@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox