From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CC152C28B28 for ; Sat, 8 Mar 2025 19:22:35 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 24E646B0082; Sat, 8 Mar 2025 14:22:34 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 1FC046B0083; Sat, 8 Mar 2025 14:22:34 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 077466B0085; Sat, 8 Mar 2025 14:22:34 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id DDF416B0082 for ; Sat, 8 Mar 2025 14:22:33 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id B8705AB16B for ; Sat, 8 Mar 2025 19:22:33 +0000 (UTC) X-FDA: 83199355386.29.236A01A Received: from mail-pl1-f169.google.com (mail-pl1-f169.google.com [209.85.214.169]) by imf15.hostedemail.com (Postfix) with ESMTP id A0448A0006 for ; Sat, 8 Mar 2025 19:22:31 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vXADMjlf; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=almasrymina@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741461751; a=rsa-sha256; cv=none; b=CgNI09b71jIumKGvyqX90M40y1MbZitIXbP6gN3Nbj0SJMyQ2AcW36tj5lVQGpQBjk9y/T Oqy6T37jfA3ddl9R31mQGzzAM1zrl0/7J9XtjBiIJC1faMyIdsV94rCriVKpOMIGbvfPJs 0LoOhHgsMMmfKa75m+jOL7BPhZk2X5A= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=vXADMjlf; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf15.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.169 as permitted sender) smtp.mailfrom=almasrymina@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1741461751; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=Ohk4ZGzrKb24wO0NxSW3j8TMIMIVLFeUJRm6pP0vhoI=; b=MyaxT2uGcwVm2yV8qUtxKp/b3AQ3jW8WUQIpX7Ar9JHpqhxAPNRADYP8ZXC4AlyLGOoScj EJ0nCQLEW5pBGeO4p7XEwOH6Somc41jqRU/hhb+06IAXmFpIqeDAEBuwmxL7H2mhYzpqbT cWIddYddmA28wcuPrJoqqhVuliEJvuo= Received: by mail-pl1-f169.google.com with SMTP id d9443c01a7336-2242ac37caeso86605ad.1 for ; Sat, 08 Mar 2025 11:22:31 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1741461750; x=1742066550; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Ohk4ZGzrKb24wO0NxSW3j8TMIMIVLFeUJRm6pP0vhoI=; b=vXADMjlfB/xOTvGBsFIuUOMDTj3if5eVuhswFZ7LO6Er97DYx8FTMPToWJ0f55LxtM 9Om2IbEn5f1PUFDH5tVVk+6dZfKkw/62SJgm7s1QdhKy7mfZXyH0jUBj3x6+8TJtRuf3 HbM+LnP8sNocP28kiE8yC3oJJSA0+5pKvThbz36e0YCrSrj+WMuhqvsQWNOW/qvA4Icj Z6wCI62/mb7OCuJVER0h1v7OfHfc03dlGjjXzzvIN96kRUZkzdNWM5Z4l4PDlNhFzTB5 s94Vn/TNRgCNnlHEzp6BEwabpsYkIyTETyAAd3dF2yXJLtg67PmP9R0OmVdfY83s1+yD wS6w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741461750; x=1742066550; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Ohk4ZGzrKb24wO0NxSW3j8TMIMIVLFeUJRm6pP0vhoI=; b=PDFTh7orMe9kkGvNHKB4mgETrB4WBy1ZgwbfEu3y5Kox9q89j4mcfmAg+d9Zp5TNo3 2pJ1/GIh7E8K9mSLXuzoc3jUcURuUGuNUh5sS420nUGCcBxQEgZ3r68AnZ669cFgQHx2 Bd5y2+SnkVDwVdOSeLXS1QllFeCt/XgPoHqzU4HIsRXlg+8UKkTdikxwfCg0zqYfqvTu SzzbUjnrkJTX6HjiOaw3MAYYHu3qKFdE++u9vYQHeUfmosPLXAKF6J2oAuUftlkkOxo+ IGPct7hlun8n0vzMb/csi0RH17oagGaUnU8x8EmWwI4H6dkyK2zPv0RyD3ZM4M3+yrzE TnNA== X-Forwarded-Encrypted: i=1; AJvYcCX17HGZELV9t35YKrzO/OMcnm51rHAMSOZ3KPyfyiSb3MUkfpsK17iC+L6cGinYOXpZdOGqf4WVYg==@kvack.org X-Gm-Message-State: AOJu0Yx0U3uko2+4Dq0OcR/JxdtUlV7rXTFqlaP+wnvoSyZPbT3LcQdK 7wdhXTSCZJ1vtPGRZj3vWXwqaG/cCDW8CyEUCREsa4xvYD0fVPmD1r+TT89DpJ5+eIO0s+Q/q0m 0DJgHuClmEBB6PW5ZDJq7A1W9gXoeIDSbQjZ+ X-Gm-Gg: ASbGnctKANJUcWpvHbWgGr7MFDe/xqxF2W364wR4m5bRSMKgzlma6Vxy90fonuXuL9X +1Iq9fwGvP4ovDrDOFEXesdaG1bnHuVucYaJAaY9rNI+N/zIO7B1M9RcBlbN60N6GJc0CF+x1FU 6R6kH2LSUXJdRUGrNPmO/UZsFDKRs= X-Google-Smtp-Source: AGHT+IFSIs79HgmgHgtKAR7enQqs0R9subNe5uCAm9nCjK5vphbIXdFsgLIb2CNs2BIb4dBtE8L+EIXRhWwoHmYCxpc= X-Received: by 2002:a17:902:f693:b0:21f:631c:7fc9 with SMTP id d9443c01a7336-22540d84badmr2018635ad.0.1741461749925; Sat, 08 Mar 2025 11:22:29 -0800 (PST) MIME-Version: 1.0 References: <20250308145500.14046-1-toke@redhat.com> In-Reply-To: <20250308145500.14046-1-toke@redhat.com> From: Mina Almasry Date: Sat, 8 Mar 2025 11:22:16 -0800 X-Gm-Features: AQ5f1JrR286rUechKgWwQfHsS5kkOGQfcIwm1OTdS6Y6I_-SSJkSLVMw_RjZrfM Message-ID: Subject: Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and unmap them when destroying the pool To: =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , Pavel Begunkov , David Wei Cc: Andrew Morton , Jesper Dangaard Brouer , Ilias Apalodimas , "David S. Miller" , Yunsheng Lin , Yonglong Liu , Eric Dumazet , Jakub Kicinski , Paolo Abeni , Simon Horman , linux-mm@kvack.org, netdev@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: A0448A0006 X-Stat-Signature: j4nbakheth1z6gyq8gj4map57tr9e3tk X-Rspamd-Server: rspam02 X-Rspam-User: X-HE-Tag: 1741461751-585291 X-HE-Meta: U2FsdGVkX19rvGeoagh5qOmwaNLUO/63TxiqzxGOxoM/VhfFoI4XiEpEllclbYi/l350KEgn0QRmviYPzbnIijRMq0JKaS3mfCAfho2yqcvHb9bcEGqoTLM23T17R2z83eQFzEji/q75MYo+10JKRxZYVBwBR9V5TJ0jeOE109ePARAtgZHCI37huGh1R45u4/vvUkLxppaMPYBpwR2sC2absdBrPOintn9UVmio3lYjr6/yK0F13KrZLngah9ymlgupjRBXoC9PU1w916nBnpbvuxwRx1dHsaQHOsPlV33hPqZ15Bu1Jbz/KvJUQh4JEv6gqC8Zd9gy5xvXN94AodkvmAZJXL1OvFbItuOjoh7F16FgLiiK90ZqSI3H7OzbmoX0Nbh80CEXLj7D2I8BpD5G2jI0/xclY8+kK8QiqKoXtZVT+AyL6aCBzeIbe5gkzIvcj7u0NcaoBxqXpslP+yk/SYYKj6NGPGhgEQghKR/qfyK2fPS+3Kh5S75+XfUoMAxGAYn1kvnrftXoTrX10TfLT+nz9gXUYIrxppfJEcfZk9a58/arwLgzsOOpmu8P4xP6bQLdpbgrMZjrniDSKJpIn0++hsvGjazxSZpDeoSpe7dqimrCHnM89RQVo6csYkUvO9cYtQ4fQ9hs6ke3h+CTUIo6BBP7Et1HMEbXLP4PluSDyKN9X2Se3lQ/YH76FHH1EaXLfz+pelT/hR8pirXKJvzolBqAUt+R76ZfWx7liz8W3LXdTPBWlb3qiSRuPgcOWwYZGViS6HWljDK/fUc1lAbLfTFS+tp6Lc3ENF/Pru63eu5hiRgU446KYRM4VpfSaZ9VFrHNKAKiKxnuiEaqt4Phx3DGgn5fZFhYMEqfM2noy+l9kqrQvK240Bw/XJ+blNIiMMEg9fkulKIFmqAm8hkZBr1cDUPOt+VbOB5frqeDZL89eEo5PTjcjR7gml5FvFEXFkyMVsQO5+J BZ/Bh2Tk mADaiJvYHbo/fB6rZHJ1FxeIGlldTItbzLnkMPnwUjWraWszmBj6cFVVv6sXN+2SlUkSEbbfjqTzbX77gfsdfwD+CESGP2N4funz/fMVDwoyKEpT0VYe38AnVwPKkcFwLSHP76gMI6wlFLdvIe7MglCWud+96snU0Atn2ZV5EqyWHDDlornZasJEZro9DJTFROcOyDI03eo1JBMzn1k1kIPJ2nt8e3DkeP9MXbI0e63440/2+nEH4DQwYddtI8LTNjD+n96taw6jEL2DWFdGd7YZedYBbnPQ/Q4z2A2EEPfFCkwl4c5jrUWhMhSNi8rk0pcls5eO3Tt2BuIZ1tl+l3Svjgc452kN8QvxjjzsxR0gWE2VH03vn1g5OjqXUA7Lu9MZ0fLFR2ui/wgJYJ+beTXxURp17LRJ9wIpv8w9wMiS+ukhisGOGa2EJXA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 8, 2025 at 6:55=E2=80=AFAM Toke H=C3=B8iland-J=C3=B8rgensen wrote: > > When enabling DMA mapping in page_pool, pages are kept DMA mapped until > they are released from the pool, to avoid the overhead of re-mapping the > pages every time they are used. This causes problems when a device is > torn down, because the page pool can't unmap the pages until they are > returned to the pool. This causes resource leaks and/or crashes when > there are pages still outstanding while the device is torn down, because > page_pool will attempt an unmap of a non-existent DMA device on the > subsequent page return. > > To fix this, implement a simple tracking of outstanding dma-mapped pages > in page pool using an xarray. This was first suggested by Mina[0], and > turns out to be fairly straight forward: We simply store pointers to > pages directly in the xarray with xa_alloc() when they are first DMA > mapped, and remove them from the array on unmap. Then, when a page pool > is torn down, it can simply walk the xarray and unmap all pages still > present there before returning, which also allows us to get rid of the > get/put_device() calls in page_pool. THANK YOU!! I had been looking at the other proposals to fix this here and there and I had similar feelings to you. They add lots of code changes and the code changes themselves were hard for me to understand. I hope we can make this simpler approach work. > Using xa_cmpxchg(), no additional > synchronisation is needed, as a page will only ever be unmapped once. > Very clever. I had been wondering how to handle the concurrency. I also think this works. > To avoid having to walk the entire xarray on unmap to find the page > reference, we stash the ID assigned by xa_alloc() into the page > structure itself, in the field previously called '_pp_mapping_pad' in > the page_pool struct inside struct page. This field overlaps with the > page->mapping pointer, which may turn out to be problematic, so an > alternative is probably needed. Sticking the ID into some of the upper > bits of page->pp_magic may work as an alternative, but that requires > further investigation. Using the 'mapping' field works well enough as > a demonstration for this RFC, though. > I'm unsure about this. I think page->mapping may be used when we map the page to the userspace in TCP zerocopy, but I'm really not sure. Yes, finding somewhere else to put the id would be ideal. Do we really need a full unsigned long for the pp_magic? > Since all the tracking is performed on DMA map/unmap, no additional code > is needed in the fast path, meaning the performance overhead of this > tracking is negligible. The extra memory needed to track the pages is > neatly encapsulated inside xarray, which uses the 'struct xa_node' > structure to track items. This structure is 576 bytes long, with slots > for 64 items, meaning that a full node occurs only 9 bytes of overhead > per slot it tracks (in practice, it probably won't be this efficient, > but in any case it should be an acceptable overhead). > Yes, I think I also saw in another thread that this version actually produces better perf results than the more complicated version, because the page_pool benchmark actually does no mapping and this version doesn't add overhead when there is no mapping. As an aside I have it in my todo list to put the page_pool benchmark in the upstream kernel so we don't have to run out-of-tree versions. > [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KB= G9DcVJcyWg@mail.gmail.com/ > > Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code") > Reported-by: Yonglong Liu > Suggested-by: Mina Almasry > Reviewed-by: Jesper Dangaard Brouer > Tested-by: Jesper Dangaard Brouer > Signed-off-by: Toke H=C3=B8iland-J=C3=B8rgensen I'm only holding back my Reviewed-by for the page->mapping issue. Other than that, I think this is great. Thank you very much for looking. Pavel, David, as an aside, I think we need to propagate this fix to memory providers as a follow up. We probably need a new op in the provider to unmap. Then, in page_pool_scrub, where this patch does an xa_for_each, we need to call that unmap op. For the io_uring memory provider, I think it needs to unmap its memory (and record it did so, so it doesn't re-unmap it later). For the dmabuf memory provider, I think maybe we need to call dma_buf_unmap_attachment (and record we did that, so we don't re-do it later). I don't think propagating this fix to memory providers should block merging the fix for pages, IMO. > --- > This is an alternative to Yunsheng's series. Yunsheng requested I send > this as an RFC to better be able to discuss the different approaches; see > some initial discussion in[1], also regarding where to store the ID as > alluded to above. > > -Toke > > [1] https://lore.kernel.org/r/40b33879-509a-4c4a-873b-b5d3573b6e14@gmail.= com > > include/linux/mm_types.h | 2 +- > include/net/page_pool/types.h | 2 ++ > net/core/netmem_priv.h | 17 +++++++++++++ > net/core/page_pool.c | 46 +++++++++++++++++++++++++++++------ > 4 files changed, 58 insertions(+), 9 deletions(-) > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 0234f14f2aa6..d2c7a7b04bea 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -121,7 +121,7 @@ struct page { > */ > unsigned long pp_magic; > struct page_pool *pp; > - unsigned long _pp_mapping_pad; > + unsigned long pp_dma_index; > unsigned long dma_addr; > atomic_long_t pp_ref_count; > }; > diff --git a/include/net/page_pool/types.h b/include/net/page_pool/types.= h > index 36eb57d73abc..13597a77aa36 100644 > --- a/include/net/page_pool/types.h > +++ b/include/net/page_pool/types.h > @@ -221,6 +221,8 @@ struct page_pool { > void *mp_priv; > const struct memory_provider_ops *mp_ops; > > + struct xarray dma_mapped; > + > #ifdef CONFIG_PAGE_POOL_STATS > /* recycle stats are per-cpu to avoid locking */ > struct page_pool_recycle_stats __percpu *recycle_stats; > diff --git a/net/core/netmem_priv.h b/net/core/netmem_priv.h > index 7eadb8393e00..59679406a7b7 100644 > --- a/net/core/netmem_priv.h > +++ b/net/core/netmem_priv.h > @@ -28,4 +28,21 @@ static inline void netmem_set_dma_addr(netmem_ref netm= em, > { > __netmem_clear_lsb(netmem)->dma_addr =3D dma_addr; > } > + > +static inline unsigned long netmem_get_dma_index(netmem_ref netmem) > +{ > + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) > + return 0; > + > + return netmem_to_page(netmem)->pp_dma_index; > +} > + > +static inline void netmem_set_dma_index(netmem_ref netmem, > + unsigned long id) > +{ > + if (WARN_ON_ONCE(netmem_is_net_iov(netmem))) > + return; > + > + netmem_to_page(netmem)->pp_dma_index =3D id; > +} > #endif > diff --git a/net/core/page_pool.c b/net/core/page_pool.c > index acef1fcd8ddc..d5530f29bf62 100644 > --- a/net/core/page_pool.c > +++ b/net/core/page_pool.c > @@ -226,6 +226,8 @@ static int page_pool_init(struct page_pool *pool, > return -EINVAL; > > pool->dma_map =3D true; > + > + xa_init_flags(&pool->dma_mapped, XA_FLAGS_ALLOC1); > } > > if (pool->slow.flags & PP_FLAG_DMA_SYNC_DEV) { > @@ -275,9 +277,6 @@ static int page_pool_init(struct page_pool *pool, > /* Driver calling page_pool_create() also call page_pool_destroy(= ) */ > refcount_set(&pool->user_cnt, 1); > > - if (pool->dma_map) > - get_device(pool->p.dev); > - > if (pool->slow.flags & PP_FLAG_ALLOW_UNREADABLE_NETMEM) { > /* We rely on rtnl_lock()ing to make sure netdev_rx_queue > * configuration doesn't change while we're initializing > @@ -325,7 +324,7 @@ static void page_pool_uninit(struct page_pool *pool) > ptr_ring_cleanup(&pool->ring, NULL); > > if (pool->dma_map) > - put_device(pool->p.dev); > + xa_destroy(&pool->dma_mapped); > > #ifdef CONFIG_PAGE_POOL_STATS > if (!pool->system) > @@ -470,9 +469,11 @@ page_pool_dma_sync_for_device(const struct page_pool= *pool, > __page_pool_dma_sync_for_device(pool, netmem, dma_sync_si= ze); > } > > -static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem) > +static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem,= gfp_t gfp) > { > dma_addr_t dma; > + int err; > + u32 id; > > /* Setup DMA mapping: use 'struct page' area for storing DMA-addr > * since dma_addr_t can be either 32 or 64 bits and does not alwa= ys fit > @@ -486,9 +487,19 @@ static bool page_pool_dma_map(struct page_pool *pool= , netmem_ref netmem) > if (dma_mapping_error(pool->p.dev, dma)) > return false; > > + if (in_softirq()) > + err =3D xa_alloc(&pool->dma_mapped, &id, netmem_to_page(n= etmem), > + XA_LIMIT(1, UINT_MAX), gfp); > + else > + err =3D xa_alloc_bh(&pool->dma_mapped, &id, netmem_to_pag= e(netmem), > + XA_LIMIT(1, UINT_MAX), gfp); nit: I think xa_limit_32b works here, because XA_FLAGS_ALLOC1 avoids 1 anyw= ay. --=20 Thanks, Mina