Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Yunsheng Lin <linyunsheng@huawei.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com,
	liuyonglong@huawei.com, Robin Murphy <robin.murphy@arm.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	IOMMU <iommu@lists.linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
Date: Thu, 24 Oct 2024 16:40:40 +0200	[thread overview]
Message-ID: <878qudftsn.fsf@toke.dk> (raw)
In-Reply-To: <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com>

Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/23 2:14, Jesper Dangaard Brouer wrote:
>> 
>> 
>> On 22/10/2024 05.22, Yunsheng Lin wrote:
>>> Networking driver with page_pool support may hand over page
>>> still with dma mapping to network stack and try to reuse that
>>> page after network stack is done with it and passes it back
>>> to page_pool to avoid the penalty of dma mapping/unmapping.
>>> With all the caching in the network stack, some pages may be
>>> held in the network stack without returning to the page_pool
>>> soon enough, and with VF disable causing the driver unbound,
>>> the page_pool does not stop the driver from doing it's
>>> unbounding work, instead page_pool uses workqueue to check
>>> if there is some pages coming back from the network stack
>>> periodically, if there is any, it will do the dma unmmapping
>>> related cleanup work.
>>>
>>> As mentioned in [1], attempting DMA unmaps after the driver
>>> has already unbound may leak resources or at worst corrupt
>>> memory. Fundamentally, the page pool code cannot allow DMA
>>> mappings to outlive the driver they belong to.
>>>
>>> Currently it seems there are at least two cases that the page
>>> is not released fast enough causing dma unmmapping done after
>>> driver has already unbound:
>>> 1. ipv4 packet defragmentation timeout: this seems to cause
>>>     delay up to 30 secs.
>>> 2. skb_defer_free_flush(): this may cause infinite delay if
>>>     there is no triggering for net_rx_action().
>>>
>>> In order not to do the dma unmmapping after driver has already
>>> unbound and stall the unloading of the networking driver, add
>>> the pool->items array to record all the pages including the ones
>>> which are handed over to network stack, so the page_pool can
>> 
>> I really really dislike this approach!
>> 
>> Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> 
>> Having to keep an array to record all the pages including the ones
>> which are handed over to network stack, goes against the very principle
>> behind page_pool. We added members to struct page, such that pages could
>> be "outstanding".
>
> Before and after this patch both support "outstanding", the difference is
> how many "outstanding" pages do they support.
>
> The question seems to be do we really need unlimited inflight page for
> page_pool to work as mentioned in [1]?
>
> 1. https://lore.kernel.org/all/5d9ea7bd-67bb-4a9d-a120-c8f290c31a47@huawei.com/

Well, yes? Imposing an arbitrary limit on the number of in-flight
packets (especially such a low one as in this series) is a complete
non-starter. Servers have hundreds of gigs of memory these days, and if
someone wants to use that for storing in-flight packets, the kernel
definitely shouldn't impose some (hard-coded!) limit on that.

>> 
>> The page_pool already have a system for waiting for these outstanding /
>> inflight packets to get returned.  As I suggested before, the page_pool
>> should simply take over the responsability (from net_device) to free the
>> struct device (after inflight reach zero), where AFAIK the DMA device is
>> connected via.
>
> It seems you mentioned some similar suggestion in previous version,
> it would be good to show some code about the idea in your mind, I am sure
> that Yonglong Liu Cc'ed will be happy to test it if there some code like
> POC/RFC is provided.

I believe Jesper is basically referring to Jakub's RFC that you
mentioned below.

> I should mention that it seems that DMA device is not longer vaild when
> remove() function of the device driver returns, as mentioned in [2], which
> means dma API is not allowed to called after remove() function of the device
> driver returns.
>
> 2. https://www.spinics.net/lists/netdev/msg1030641.html
>
>> 
>> The alternative is what Kuba suggested (and proposed an RFC for),  that
>> the net_device teardown waits for the page_pool inflight packets.
>
> As above, the question is how long does the waiting take here?
> Yonglong tested Kuba's RFC, see [3], the waiting took forever due to
> reason as mentioned in commit log:
> "skb_defer_free_flush(): this may cause infinite delay if there is no
> triggering for net_rx_action()."

Honestly, this just seems like a bug (the "no triggering of
net_rx_action()") that should be root caused and fixed; not a reason
that waiting can't work.

-Toke

next prev parent reply	other threads:[~2024-10-24 14:40 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20241022032214.3915232-1-linyunsheng@huawei.com>
2024-10-22  3:22 ` Yunsheng Lin
2024-10-22 16:40   ` Simon Horman
2024-10-22 18:14   ` Jesper Dangaard Brouer
2024-10-23  8:59     ` Yunsheng Lin
2024-10-24 14:40       ` Toke Høiland-Jørgensen [this message]
2024-10-25  3:20         ` Yunsheng Lin
2024-10-25 11:16           ` Toke Høiland-Jørgensen
2024-10-25 14:07             ` Jesper Dangaard Brouer
2024-10-26  7:33               ` Yunsheng Lin
2024-11-06 13:25                 ` Jesper Dangaard Brouer
2024-11-06 15:57                   ` Jesper Dangaard Brouer
2024-11-06 19:55                     ` Alexander Duyck
2024-11-07 11:10                       ` Yunsheng Lin
2024-11-07 11:09                     ` Yunsheng Lin
2024-11-11 11:31                 ` Yunsheng Lin
2024-11-11 18:51                   ` Toke Høiland-Jørgensen
2024-11-12 12:22                     ` Yunsheng Lin
2024-11-12 14:19                       ` Jesper Dangaard Brouer
2024-11-13 12:21                         ` Yunsheng Lin
     [not found]                         ` <40c9b515-1284-4c49-bdce-c9eeff5092f9@huawei.com>
2024-11-18 15:11                           ` Jesper Dangaard Brouer
2024-10-26  7:32             ` Yunsheng Lin
2024-10-29 13:58               ` Toke Høiland-Jørgensen
2024-10-30 11:30                 ` Yunsheng Lin
2024-10-30 11:57                   ` Toke Høiland-Jørgensen
2024-10-31 12:17                     ` Yunsheng Lin
2024-10-31 16:18                       ` Toke Høiland-Jørgensen
2024-11-01 11:11                         ` Yunsheng Lin
2024-11-05 20:11                           ` Jesper Dangaard Brouer
2024-11-06 10:56                             ` Yunsheng Lin
2024-11-06 14:17                               ` Robin Murphy
2024-11-07  8:41                               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=878qudftsn.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fanghaiqing@huawei.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=iommu@lists.linux.dev \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=zhangkun09@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox