linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Yunsheng Lin <linyunsheng@huawei.com>,
	Jesper Dangaard Brouer <hawk@kernel.org>,
	davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com,
	liuyonglong@huawei.com, Robin Murphy <robin.murphy@arm.com>,
	Alexander Duyck <alexander.duyck@gmail.com>,
	IOMMU <iommu@lists.linux.dev>,
	Andrew Morton <akpm@linux-foundation.org>,
	Eric Dumazet <edumazet@google.com>,
	Ilias Apalodimas <ilias.apalodimas@linaro.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, kernel-team <kernel-team@cloudflare.com>
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound
Date: Thu, 31 Oct 2024 17:18:07 +0100	[thread overview]
Message-ID: <874j4sb60w.fsf@toke.dk> (raw)
In-Reply-To: <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com>

Yunsheng Lin <linyunsheng@huawei.com> writes:

> On 2024/10/30 19:57, Toke Høiland-Jørgensen wrote:
>> Yunsheng Lin <linyunsheng@huawei.com> writes:
>> 
>>>> But, well, I'm not sure it is? You seem to be taking it as axiomatic
>>>> that the wait in itself is bad. Why? It's just a bit memory being held
>>>> on to while it is still in use, and so what?
>>>
>>> Actually, I thought about adding some sort of timeout or kicking based on
>>> jakub's waiting patch too.
>>>
>>> But after looking at more caching in the networking, waiting and kicking/flushing
>>> seems harder than recording the inflight pages, mainly because kicking/flushing
>>> need very subsystem using page_pool owned page to provide a kicking/flushing
>>> mechanism for it to work, not to mention how much time does it take to do all
>>> the kicking/flushing.
>> 
>> Eliding the details above, but yeah, you're right, there are probably
>> some pernicious details to get right if we want to flush all caches. S
>> I wouldn't do that to start with. Instead, just add the waiting to start
>> with, then wait and see if this actually turns out to be a problem in
>> practice. And if it is, identify the source of that problem, deal with
>> it, rinse and repeat :)
>
> I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
> see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
> stalling forever and some task hung, see [2].
>
> The root cause for the above case is skb_defer_free_flush() not being called
> as mentioned before.

Well, let's fix that, then! We already logic to flush backlogs when a
netdevice is going away, so AFAICT all that's needed is to add the
skb_defer_free_flush() to that logic. Totally untested patch below, that
we should maybe consider applying in any case.

> I am not sure if I understand the reasoning behind the above suggestion to 'wait
> and see if this actually turns out to be a problem' when we already know that there
> are some cases which need cache kicking/flushing for the waiting to work and those
> kicking/flushing may not be easy and may take indefinite time too, not to mention
> there might be other cases that need kicking/flushing that we don't know yet.
>
> Is there any reason not to consider recording the inflight pages so that unmapping
> can be done for inflight pages before driver unbound supposing dynamic number of
> inflight pages can be supported?
>
> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
> inflight pages is bad supposing the inflight pages can be unlimited and recording
> can be done with least performance overhead?

Well, page pool is a memory allocator, and it already has a mechanism to
handle returning of memory to it. You're proposing to add a second,
orthogonal, mechanism to do this, one that adds both overhead and
complexity, yet doesn't handle all cases (cf your comment about devmem).

And even if it did handle all cases, force-releasing pages in this way
really feels like it's just papering over the issue. If there are pages
being leaked (or that are outstanding forever, which basically amounts
to the same thing), that is something we should be fixing the root cause
of, not just working around it like this series does.

-Toke


Patch to flush the deferred free list when taking down a netdevice;
compile-tested only:



diff --git a/net/core/dev.c b/net/core/dev.c
index ea5fbcd133ae..6e64e24ad6fa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5955,6 +5955,27 @@ EXPORT_SYMBOL(netif_receive_skb_list);
 
 static DEFINE_PER_CPU(struct work_struct, flush_works);
 
+static void skb_defer_free_flush(struct softnet_data *sd)
+{
+	struct sk_buff *skb, *next;
+
+	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
+	if (!READ_ONCE(sd->defer_list))
+		return;
+
+	spin_lock(&sd->defer_lock);
+	skb = sd->defer_list;
+	sd->defer_list = NULL;
+	sd->defer_count = 0;
+	spin_unlock(&sd->defer_lock);
+
+	while (skb != NULL) {
+		next = skb->next;
+		napi_consume_skb(skb, 1);
+		skb = next;
+	}
+}
+
 /* Network device is going away, flush any packets still pending */
 static void flush_backlog(struct work_struct *work)
 {
@@ -5964,6 +5985,8 @@ static void flush_backlog(struct work_struct *work)
 	local_bh_disable();
 	sd = this_cpu_ptr(&softnet_data);
 
+	skb_defer_free_flush(sd);
+
 	backlog_lock_irq_disable(sd);
 	skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) {
 		if (skb->dev->reg_state == NETREG_UNREGISTERING) {
@@ -6001,6 +6024,9 @@ static bool flush_required(int cpu)
 		   !skb_queue_empty_lockless(&sd->process_queue);
 	backlog_unlock_irq_enable(sd);
 
+	if (!do_flush && READ_ONCE(sd->defer_list))
+		do_flush = true;
+
 	return do_flush;
 #endif
 	/* without RPS we can't safely check input_pkt_queue: during a
@@ -6298,27 +6324,6 @@ struct napi_struct *napi_by_id(unsigned int napi_id)
 	return NULL;
 }
 
-static void skb_defer_free_flush(struct softnet_data *sd)
-{
-	struct sk_buff *skb, *next;
-
-	/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
-	if (!READ_ONCE(sd->defer_list))
-		return;
-
-	spin_lock(&sd->defer_lock);
-	skb = sd->defer_list;
-	sd->defer_list = NULL;
-	sd->defer_count = 0;
-	spin_unlock(&sd->defer_lock);
-
-	while (skb != NULL) {
-		next = skb->next;
-		napi_consume_skb(skb, 1);
-		skb = next;
-	}
-}
-
 #if defined(CONFIG_NET_RX_BUSY_POLL)
 
 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)



  reply	other threads:[~2024-10-31 16:18 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <20241022032214.3915232-1-linyunsheng@huawei.com>
2024-10-22  3:22 ` Yunsheng Lin
2024-10-22 16:40   ` Simon Horman
2024-10-22 18:14   ` Jesper Dangaard Brouer
2024-10-23  8:59     ` Yunsheng Lin
2024-10-24 14:40       ` Toke Høiland-Jørgensen
2024-10-25  3:20         ` Yunsheng Lin
2024-10-25 11:16           ` Toke Høiland-Jørgensen
2024-10-25 14:07             ` Jesper Dangaard Brouer
2024-10-26  7:33               ` Yunsheng Lin
2024-11-06 13:25                 ` Jesper Dangaard Brouer
2024-11-06 15:57                   ` Jesper Dangaard Brouer
2024-11-06 19:55                     ` Alexander Duyck
2024-11-07 11:10                       ` Yunsheng Lin
2024-11-07 11:09                     ` Yunsheng Lin
2024-11-11 11:31                 ` Yunsheng Lin
2024-11-11 18:51                   ` Toke Høiland-Jørgensen
2024-11-12 12:22                     ` Yunsheng Lin
2024-11-12 14:19                       ` Jesper Dangaard Brouer
2024-11-13 12:21                         ` Yunsheng Lin
     [not found]                         ` <40c9b515-1284-4c49-bdce-c9eeff5092f9@huawei.com>
2024-11-18 15:11                           ` Jesper Dangaard Brouer
2024-10-26  7:32             ` Yunsheng Lin
2024-10-29 13:58               ` Toke Høiland-Jørgensen
2024-10-30 11:30                 ` Yunsheng Lin
2024-10-30 11:57                   ` Toke Høiland-Jørgensen
2024-10-31 12:17                     ` Yunsheng Lin
2024-10-31 16:18                       ` Toke Høiland-Jørgensen [this message]
2024-11-01 11:11                         ` Yunsheng Lin
2024-11-05 20:11                           ` Jesper Dangaard Brouer
2024-11-06 10:56                             ` Yunsheng Lin
2024-11-06 14:17                               ` Robin Murphy
2024-11-07  8:41                               ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=874j4sb60w.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alexander.duyck@gmail.com \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=fanghaiqing@huawei.com \
    --cc=hawk@kernel.org \
    --cc=ilias.apalodimas@linaro.org \
    --cc=iommu@lists.linux.dev \
    --cc=kernel-team@cloudflare.com \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linyunsheng@huawei.com \
    --cc=liuyonglong@huawei.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=robin.murphy@arm.com \
    --cc=zhangkun09@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox