From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57648D767E1 for ; Thu, 31 Oct 2024 16:18:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id E21AD6B0082; Thu, 31 Oct 2024 12:18:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id DD1636B0085; Thu, 31 Oct 2024 12:18:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C72C26B008C; Thu, 31 Oct 2024 12:18:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id AA39F6B0082 for ; Thu, 31 Oct 2024 12:18:15 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 6A8D141721 for ; Thu, 31 Oct 2024 16:18:15 +0000 (UTC) X-FDA: 82734403752.24.E6F9BDD Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf25.hostedemail.com (Postfix) with ESMTP id A5838A000A for ; Thu, 31 Oct 2024 16:17:53 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DgtgJaBV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of toke@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1730391331; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=jyMxjyMJzDjgdzMLpj9RIlepCqJ9T1p2ImGI0IFMw38=; b=57RBjJeFu9QmkeRDgW3Za3dWn93XNuVEl0JC6NcCW3FPfbht+mX1EU/DXEFRil9IIEM1JC nVxp/BOYKv4sKTSzgOxwCd2t6OWj00O1oOkoO2DqBCmApOFeEOPpHign2VSy6bCGil1rpR XTo9xhCVockfwOH81ZJNQlNtFxlHuvE= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DgtgJaBV; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of toke@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730391331; a=rsa-sha256; cv=none; b=znut8yBwPCa94c9Zq33z7CHPYUGeRf70kt7gA4RxNxFSyGqkUUUjyoOXPU2ksgonEenrqA BHvW6MsvPD0up1N/mSUDxQROgt5Fyf8RcIMekuLwWpyOuj64rZ3lu4uJtf4iHT4YyaMPBc y7znieyO1EuX2OOlmVIfxm2GWH5dcm4= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1730391492; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jyMxjyMJzDjgdzMLpj9RIlepCqJ9T1p2ImGI0IFMw38=; b=DgtgJaBVdoskpUry3Jo3BP7P5GX3GxQNRm0JGCY6uYtYh01Ik9nCBxZtY4An4lM1zJneNT dJt1DgjT03gfiSLNSab9sKrHooz1QkWL/OWiBukco0TKEl9+lnbSA1/S1/oy/cjxctG1xO Di5wtLaHK9xAN0gD8ygrqkeQVaK4OmI= Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-149-D4DrgINiPSu_n5KoFh3Xrw-1; Thu, 31 Oct 2024 12:18:11 -0400 X-MC-Unique: D4DrgINiPSu_n5KoFh3Xrw-1 Received: by mail-ej1-f70.google.com with SMTP id a640c23a62f3a-a99fa9f0c25so83618066b.3 for ; Thu, 31 Oct 2024 09:18:11 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1730391490; x=1730996290; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=cxx09rhPXBW0+fbvKhcDxytRWwt5G4IT/stJwiAJCe0=; b=sb0VQQnLzImeKfRAfTkXojfVZjt5I0fa7WRJ3kpLjUubm3ct5XOuZ9xwIHkVLTaGeB z4Ke6KJs+3pwybwzy4O46wNlG7F7zKjnkf+Ko7MrmbZ1K4EeeYL4aQSoMKCxtnsYMqG4 x78cnWHRlEkcTywleQwb/8Nt1t7LWGcaOCiH4foUMftXSO/cibdIFLf7jx5Fd0UR6J9y nQ5I4YNs4Dm0YrLqQwVRQIm2aBAwLhrq+1VeBOPjKLjLdWiGEDpZXgQfl2zUOTEseXGw BGGl9OC+SEEyFid4RzN2nfrMT2cCvewyu2OvBzp3aHP7S4ijUyZbIIfwXEXxuBbX1Gz+ m9TA== X-Forwarded-Encrypted: i=1; AJvYcCWCMR9Pdg+SFr1bdtbyZNXbsqSTNuFDOxWyWmHiMGVdtiPv+nSUECaWPRt9lDzsaFRRXW9b2gNHSg==@kvack.org X-Gm-Message-State: AOJu0YxTK6BOvjVIdKTdLVpz9E3vcv8DmVf4Tam4HxdziuQXdRkJOjMY v7XY6Irk9TW6CAiOhKEn+le8oPEGcQm9SY+2PlnNNkMG0AQmKn6GzqSoiaqVlfdKXM25fhbzhFs FSe1X1uGRa8rVxKh7Wn0baWP362FrYngP5cwb1QZ9vJ9d2MUb X-Received: by 2002:a17:907:96ac:b0:a9a:1796:30d0 with SMTP id a640c23a62f3a-a9de63327c3mr2013678366b.62.1730391489860; Thu, 31 Oct 2024 09:18:09 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGcPr4JkEYqmPz2R4rXU5fH3oMbGqzpFoCe3YaKkRKxGz/Uy/aeNVT/TYWyZnSar5vduQlf6g== X-Received: by 2002:a17:907:96ac:b0:a9a:1796:30d0 with SMTP id a640c23a62f3a-a9de63327c3mr2013674966b.62.1730391489327; Thu, 31 Oct 2024 09:18:09 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9e56645ecesm84789766b.178.2024.10.31.09.18.08 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 31 Oct 2024 09:18:08 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id CEC11164B7AF; Thu, 31 Oct 2024 17:18:07 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Yunsheng Lin , Jesper Dangaard Brouer , davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com, liuyonglong@huawei.com, Robin Murphy , Alexander Duyck , IOMMU , Andrew Morton , Eric Dumazet , Ilias Apalodimas , linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, kernel-team Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver has already unbound In-Reply-To: <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com> References: <20241022032214.3915232-1-linyunsheng@huawei.com> <20241022032214.3915232-4-linyunsheng@huawei.com> <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk> <87r084e8lc.fsf@toke.dk> <878qu7c8om.fsf@toke.dk> <1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com> <87o731by64.fsf@toke.dk> <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Thu, 31 Oct 2024 17:18:07 +0100 Message-ID: <874j4sb60w.fsf@toke.dk> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: A5838A000A X-Stat-Signature: g89axxxrrejxkyuicortea83ukefizwy X-Rspam-User: X-HE-Tag: 1730391473-428354 X-HE-Meta: U2FsdGVkX18spkTEUMQGnilMwGBHByXjYXCgH6GEPH5Jb3w3trAwo8g6QcjuwsGgxUuJXMtsEofHSrt/ITKACSgAINnSlQ3lU0Ye3afj9BJyi3roG/WGWMZe8woR9kBbf09ufbrTWiuFMTIFncNEBOF1Lsl3MuDWDgTPm2JW/C9NbOMNmL3aQEVJlDezbSYmZzKmu1PlXVo4bswXhWb+yajpmJIOmTb0IPO9OukIXkWRpBKH9hSOgpawn5s7N1wKslC54ZvSlzmXpXqKWZ1pZpxFsNx2iPxX6tMnyybQTXchd1aB84fuOf67HIhdCK76AGOqr/24mOdMrU0805hB+bDt/WyvDexTKooEGLtgyWZKg+S8HxMGAhIxnU2+M8/5CDQkIE6X7+THqJm6SLQ7LNTCKOs5QyHi6uXx6i8/wWsVfGTJiZKuaPX+uQpGOoTPEdKyRcN+e72NJlRa/CY4PvO1eMUCpT9A3IUfFwcAKb/7PxHGSWFWX01JNGgOSZYw17GZ44RPdwlIvzT8gTRsnKneto4eHBOUneQ5bZLwfINu74dnrHB6MH/fSEuarwwmDrqYPgJEuzkXQHFa9iB3ny9cghzAQTX+KEuV4mPJCqlHT79ruDQ+1ZRH9JR3wJYqooNKOMmwOaFQXsAse/yDJere2dSIosA2aA0XFAyZUpC/kQV7VvEK7oGh64YPvVpp2Q2yB52JG8aex4gBB/LlWl4UzyVm/PdLwLcBbGWQNUsAIKKFkfqoaqQwk/fIenfqc1NZY+52gDrX1rFidD0ge+0GMi7geK+FGZnavlaQ+HiewCyBAkkqpRGiDIPYPNWymVrwhT9Boukb/NJ2Qct4N/A0TjIEaMNrOmc1UkqoNDcPHMEAG588ssCeeeJYCoX9zFonlqMmOngkKiUdVW70Nl/cVpYKjrr0R5gK9a4sF/GaIOtXHZU2w0kXK2JaB/2jsKj7OP5VrdLeFRjgmr+ I3r0Il6X vKxP3NP0591qtcwxTyAUh+Bku3pnniKJGlwo7M46aAroECrNOpgQKewhGS/tKsQADwHBZhxgTILdsw0ehtV4zf+9JM2+4yEyWml+7D3Rgt8nqZY7MSzs3b9Jf1ic7OAGkuAhlSTzXr9tKpeietS4w5ovVGC/sBy+8MBDi0W1tfrP5rPf12lxRO7VzQw33Dg++DCHUoM/jgDjGbABuwf/i0Zj8KwpYrcfdn3hQ1quHJiHJY8mRDmTGqKy7DTY1s/rsXtSh/lGIccQO5Yqzbt19+FtAaaiHzIxLbTgG6roMSUgggXQhpsxOHf/GFJK1XQyCKJxooR4QatdF8csmqE8nixZ78SgD30m31gNiVgJgQd/f5qtaDKKRbd8rG9cQGwDgg1FZkh3DMNPjZJjq94PkkuYdRU+0Z3tQrpXIN1fYqgCdrPIgpMhbuIenc7rzB/T/nBXTXW3DbWo2lKwMgp6Q0xAWXNPwKLPH3R59Q9F+PpM/n4eZxGSxYyKQJSeEPQheec1mZWDvNx8wF/m1qbXoKB82PqHbIO7wOcZs/DiquFMMGb+8/LCqjewcJgL4NlmR7btkc6yQlV+WoVU= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yunsheng Lin writes: > On 2024/10/30 19:57, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Yunsheng Lin writes: >>=20 >>>> But, well, I'm not sure it is? You seem to be taking it as axiomatic >>>> that the wait in itself is bad. Why? It's just a bit memory being held >>>> on to while it is still in use, and so what? >>> >>> Actually, I thought about adding some sort of timeout or kicking based = on >>> jakub's waiting patch too. >>> >>> But after looking at more caching in the networking, waiting and kickin= g/flushing >>> seems harder than recording the inflight pages, mainly because kicking/= flushing >>> need very subsystem using page_pool owned page to provide a kicking/flu= shing >>> mechanism for it to work, not to mention how much time does it take to = do all >>> the kicking/flushing. >>=20 >> Eliding the details above, but yeah, you're right, there are probably >> some pernicious details to get right if we want to flush all caches. S >> I wouldn't do that to start with. Instead, just add the waiting to start >> with, then wait and see if this actually turns out to be a problem in >> practice. And if it is, identify the source of that problem, deal with >> it, rinse and repeat :) > > I am not sure if I have mentioned to you that jakub had a RFC for the wai= ting, > see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver = unload > stalling forever and some task hung, see [2]. > > The root cause for the above case is skb_defer_free_flush() not being cal= led > as mentioned before. Well, let's fix that, then! We already logic to flush backlogs when a netdevice is going away, so AFAICT all that's needed is to add the skb_defer_free_flush() to that logic. Totally untested patch below, that we should maybe consider applying in any case. > I am not sure if I understand the reasoning behind the above suggestion t= o 'wait > and see if this actually turns out to be a problem' when we already know = that there > are some cases which need cache kicking/flushing for the waiting to work = and those > kicking/flushing may not be easy and may take indefinite time too, not to= mention > there might be other cases that need kicking/flushing that we don't know = yet. > > Is there any reason not to consider recording the inflight pages so that = unmapping > can be done for inflight pages before driver unbound supposing dynamic nu= mber of > inflight pages can be supported? > > IOW, Is there any reason you and jesper taking it as axiomatic that recor= ding the > inflight pages is bad supposing the inflight pages can be unlimited and r= ecording > can be done with least performance overhead? Well, page pool is a memory allocator, and it already has a mechanism to handle returning of memory to it. You're proposing to add a second, orthogonal, mechanism to do this, one that adds both overhead and complexity, yet doesn't handle all cases (cf your comment about devmem). And even if it did handle all cases, force-releasing pages in this way really feels like it's just papering over the issue. If there are pages being leaked (or that are outstanding forever, which basically amounts to the same thing), that is something we should be fixing the root cause of, not just working around it like this series does. -Toke Patch to flush the deferred free list when taking down a netdevice; compile-tested only: diff --git a/net/core/dev.c b/net/core/dev.c index ea5fbcd133ae..6e64e24ad6fa 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5955,6 +5955,27 @@ EXPORT_SYMBOL(netif_receive_skb_list); =20 static DEFINE_PER_CPU(struct work_struct, flush_works); =20 +static void skb_defer_free_flush(struct softnet_data *sd) +{ +=09struct sk_buff *skb, *next; + +=09/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ +=09if (!READ_ONCE(sd->defer_list)) +=09=09return; + +=09spin_lock(&sd->defer_lock); +=09skb =3D sd->defer_list; +=09sd->defer_list =3D NULL; +=09sd->defer_count =3D 0; +=09spin_unlock(&sd->defer_lock); + +=09while (skb !=3D NULL) { +=09=09next =3D skb->next; +=09=09napi_consume_skb(skb, 1); +=09=09skb =3D next; +=09} +} + /* Network device is going away, flush any packets still pending */ static void flush_backlog(struct work_struct *work) { @@ -5964,6 +5985,8 @@ static void flush_backlog(struct work_struct *work) =09local_bh_disable(); =09sd =3D this_cpu_ptr(&softnet_data); =20 +=09skb_defer_free_flush(sd); + =09backlog_lock_irq_disable(sd); =09skb_queue_walk_safe(&sd->input_pkt_queue, skb, tmp) { =09=09if (skb->dev->reg_state =3D=3D NETREG_UNREGISTERING) { @@ -6001,6 +6024,9 @@ static bool flush_required(int cpu) =09=09 !skb_queue_empty_lockless(&sd->process_queue); =09backlog_unlock_irq_enable(sd); =20 +=09if (!do_flush && READ_ONCE(sd->defer_list)) +=09=09do_flush =3D true; + =09return do_flush; #endif =09/* without RPS we can't safely check input_pkt_queue: during a @@ -6298,27 +6324,6 @@ struct napi_struct *napi_by_id(unsigned int napi_id) =09return NULL; } =20 -static void skb_defer_free_flush(struct softnet_data *sd) -{ -=09struct sk_buff *skb, *next; - -=09/* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ -=09if (!READ_ONCE(sd->defer_list)) -=09=09return; - -=09spin_lock(&sd->defer_lock); -=09skb =3D sd->defer_list; -=09sd->defer_list =3D NULL; -=09sd->defer_count =3D 0; -=09spin_unlock(&sd->defer_lock); - -=09while (skb !=3D NULL) { -=09=09next =3D skb->next; -=09=09napi_consume_skb(skb, 1); -=09=09skb =3D next; -=09} -} - #if defined(CONFIG_NET_RX_BUSY_POLL) =20 static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)