From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 5E9DAC28B2E
	for <linux-mm@archiver.kernel.org>; Mon, 10 Mar 2025 09:13:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 37DF7280002; Mon, 10 Mar 2025 05:13:48 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 32E8F280001; Mon, 10 Mar 2025 05:13:48 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1F384280002; Mon, 10 Mar 2025 05:13:48 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 04502280001
	for <linux-mm@kvack.org>; Mon, 10 Mar 2025 05:13:47 -0400 (EDT)
Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay03.hostedemail.com (Postfix) with ESMTP id 261EBA83C5
	for <linux-mm@kvack.org>; Mon, 10 Mar 2025 09:13:48 +0000 (UTC)
X-FDA: 83205078936.27.9286E1F
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf02.hostedemail.com (Postfix) with ESMTP id A97B98000D
	for <linux-mm@kvack.org>; Mon, 10 Mar 2025 09:13:45 +0000 (UTC)
Authentication-Results: imf02.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MLJwTB6q;
	spf=pass (imf02.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1741598025;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=sdpTpxBMg7h5I/8QSyu2eYaMyZ+r6eTXBCuyrMvSlX8=;
	b=A21o+I7yJfFUfjmI4KyZqoxLH/NfXl93q+XZMDleuEukBHEjMIA6URLtwOX7cLU/EYxhBK
	Zx6RjbXSmf1f8eM5k/sNA2b1BiXPkTQvSqfo6X4BMZUemAvaKp8VJOIRad//7KbmYNhJVX
	UigCprrwMIEXKyi+/syowuysZTD9oyI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1741598025; a=rsa-sha256;
	cv=none;
	b=DJ6b+9K9nv1dURMzILWkS+bg/t/EZwW6B7TQt31mpDTLFL+shTLx/mMYsIoUmTUNvrYxcy
	gZDT2Pp3xDcx59T1jm8Q8hNUBA7i+Q9+PIPOWAZRJMR/XPI1xxKoYJN7U3hGl6O+mXEDNt
	reiXs8nvJoRDgUIiO2BQ2A1ifxKBi2w=
ARC-Authentication-Results: i=1;
	imf02.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=MLJwTB6q;
	spf=pass (imf02.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1741598025;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=sdpTpxBMg7h5I/8QSyu2eYaMyZ+r6eTXBCuyrMvSlX8=;
	b=MLJwTB6q0rC8LBhi0U6OzugWuS4outXs7+v6NPRANjsfIW7HFVOL33WcITvenKvZebJsGU
	F1iPw/Zu1Qh7lOH0jhktOO3gwplYiVw60lyRWWvrFqVfrbiugjexR90QaiiebwOICQlQga
	eCITJxhhxPR6IRmdw84+z7eTUb/JvCI=
Received: from mail-ej1-f70.google.com (mail-ej1-f70.google.com
 [209.85.218.70]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-624-yHreQHjsO6CgmCSCC-UiPQ-1; Mon, 10 Mar 2025 05:13:43 -0400
X-MC-Unique: yHreQHjsO6CgmCSCC-UiPQ-1
X-Mimecast-MFC-AGG-ID: yHreQHjsO6CgmCSCC-UiPQ_1741598022
Received: by mail-ej1-f70.google.com with SMTP id a640c23a62f3a-ac1e442740cso368998566b.1
        for <linux-mm@kvack.org>; Mon, 10 Mar 2025 02:13:43 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1741598022; x=1742202822;
        h=content-transfer-encoding:mime-version:message-id:date:references
         :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=exPnN+rMcdOduLx0XQej6ZvVDtMm0qKkMiLUs5zJZKM=;
        b=d/L4EUyyD4kkW0w6nhyLAhOnZXzSAOMDokD3Lsha0R6yIW9OME7DVzSPZHlkyvPAdF
         jXHMThiaosZxYPZSJ36zAOgSFNzeG0rjVSALlkd5QkpzjdzJgxU8Dlb/Ttu3pYuNyKDN
         lErSca6PToohD9XeMMT3tL9wA3lTm0vuO46Ochc02FRSnN9CIzwuavarjp6QE+DOtPnA
         Wki4rEVHBz8OuTGQ1Z8/Qn+SXcv8rK7i+m8PScLFKDj946dSBlmkCYbCKhRrR9e5it5z
         1wihpZNBygfVc/htEtonV0hKb9srBJttXuoilgH2X3z0m8bDXVw1gm2pvzPiAxiUMO9Q
         c+uA==
X-Forwarded-Encrypted: i=1; AJvYcCWWKRJ9mtGe4+pJXA9w+Xxl1t8z1zLDDe6+9xhgBN3YcA4mxTAfE954UJ+AXFWDCuQOSgusbEMufw==@kvack.org
X-Gm-Message-State: AOJu0YwO8SZb65GQF78mg6OmhBzC+YmrieKobVxHrUHXXX7X3oY+A7Ck
	K6g51uDJf7vxnpO+qK/xZknRRjKgNpAY4rMFLAHrCb3iM5s7YHwv1K7pIPj3rU8UM/wcnizQp12
	pMV9fiEKlwNSJ+X8sbs0b+9Ip+95U9DdUK/I03+eiPfQ3avgE
X-Gm-Gg: ASbGnctioPvtMrVKASvGkwXwvV9pGwGznrENkO9I+8Bo69z5z43dS0SqkOMEgDZpUYi
	DIB/Yb1qFtF25nlHltZaIkcNWint5WLKiCf2ctvMsd1tSJfEegVwFXacMrhIv2FzXfy5ChXiWJZ
	ahkHkrICtJBiRDJ0V6YDK7LPfwFBhgE8vUI/KSmADD4E136A7bWFZyh+zBcMpga3qD4LcwyYoc8
	NbO2fzqzVfLZl3IIsrrnP7rVfPwzg7nzoCacIl8wtUUTYm3JBTAxyhudFTaqtGh6ltApvK84FJl
	x1pyqQs1ep0u
X-Received: by 2002:a17:907:3e0d:b0:ac1:fb27:d3b0 with SMTP id a640c23a62f3a-ac252aae4damr1037265466b.22.1741598022324;
        Mon, 10 Mar 2025 02:13:42 -0700 (PDT)
X-Google-Smtp-Source: AGHT+IE5HVXXzF6q5e64KAfxQdx97a/6tiE9jIfl6inGdhc0FTOo2vMUvbKHaP2p5uLqZ0eOIX09iQ==
X-Received: by 2002:a17:907:3e0d:b0:ac1:fb27:d3b0 with SMTP id a640c23a62f3a-ac252aae4damr1037261866b.22.1741598021800;
        Mon, 10 Mar 2025 02:13:41 -0700 (PDT)
Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2])
        by smtp.gmail.com with ESMTPSA id a640c23a62f3a-ac282c69e89sm322864766b.167.2025.03.10.02.13.40
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 10 Mar 2025 02:13:40 -0700 (PDT)
Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000)
	id A9C2818FA218; Mon, 10 Mar 2025 10:13:32 +0100 (CET)
From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>
To: Yunsheng Lin <yunshenglin0825@gmail.com>, Andrew Morton
 <akpm@linux-foundation.org>, Jesper Dangaard Brouer <hawk@kernel.org>,
 Ilias Apalodimas <ilias.apalodimas@linaro.org>, "David S. Miller"
 <davem@davemloft.net>
Cc: Yunsheng Lin <linyunsheng@huawei.com>, Yonglong Liu
 <liuyonglong@huawei.com>, Mina Almasry <almasrymina@google.com>, Eric
 Dumazet <edumazet@google.com>, Jakub Kicinski <kuba@kernel.org>, Paolo
 Abeni <pabeni@redhat.com>, Simon Horman <horms@kernel.org>,
 linux-mm@kvack.org, netdev@vger.kernel.org
Subject: Re: [RFC PATCH net-next] page_pool: Track DMA-mapped pages and
 unmap them when destroying the pool
In-Reply-To: <d84e19c9-be0c-4d23-908b-f5e5ab6f3f3f@gmail.com>
References: <20250308145500.14046-1-toke@redhat.com>
 <d84e19c9-be0c-4d23-908b-f5e5ab6f3f3f@gmail.com>
X-Clacks-Overhead: GNU Terry Pratchett
Date: Mon, 10 Mar 2025 10:13:32 +0100
Message-ID: <87cyepxn7n.fsf@toke.dk>
MIME-Version: 1.0
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: c7lPYbbh-4CAUA-s5okZp8Qg3vXNh7cNRT5r2UtT9I8_1741598022
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: A97B98000D
X-Stat-Signature: ejkb7ezqksy7o4ynjprq39qy8pctozt3
X-Rspamd-Server: rspam10
X-HE-Tag: 1741598025-523502
X-HE-Meta: U2FsdGVkX1/jP+D1/N39eUuNxnqKfEX6CbmtK3Dfl8ExeZ5j6azOY8ZG2BZkzZJcwiEr095+rFGtcqUaN7imsj1l8NaDGXNnX3qxJk5+0jHFazljvn5JnUP+65FThhOC9LMuhAT0VNhzWp7kvQk9danRgmnGmDCbMSYEDIKTTULRwhVjP0IxfGQ2GFIBuWTSm6Y9LEwA+O/X3JpHzESnEXEJC9lxPPSpm7MN7CZf6icSPHpcasStWZ8NOchdHw4MF8pqLtNpEO0LhKnGVRP3hfpad7L3ND1f2d52EeSUdZHn9Ht6OjqZvfr+GU9kh+cTpRqNIt1b3EbDGDrMSgNmIDzLQUr/lYQh+U+lr/x+LTKpJpWyUfUFQX3ulb59fcasQ7rZxYCiooeyADAcupY1v1mlB4TrhP6NTnJf60AGcN9VP5iENcwyREPwIMcUmL4KIakJXowsc6zgRGBpFbGIGNhIzKZGKBhy1y1FgX9aWPadaeelBGszx4HCxjKYTO3yBIInRjxxY7LK9NZjUzwQNhEM+dHMKMlP+9/0QMrR0Ggs12fDChvTobBecdhDAoDjP2sW+9y13thna9GR5p2Zvb4OXs+LgBU6On/9RuoATdwn0esUVmqs+PPCf7LH7S+Q1cuoO5N18qazDyuPI+pmUMZAo/NeK4eJvDWmy1Ez3+ckl+4vCVBxq9Qj3FMMp8uSb9tzoJddYjUgZzfQVjgbKaQrQPrwXCHFAt8SHx7fEO6TIJuR9kJOu5P51X87vi9x2gfPACGcjvbpjpeqwDCe5svIMz/eKht0SMFGIcZFHy2nXpLkrfmo6gQcNj96MCZz/x5W1EEt0wztMOP6Vp2N2uRcT8ToYSPMn311RSq72BuLwT5Y/UFDd8YDmEIPwBswYnxP2gnY6GHUX3nki+QOkKGeTOf0hUMEVVh2sVHVrgB1xVka9kVJyOc+OZfN5Nl99kf4/e9dN0j4VvlwEQF
 FhU5XRXK
 sOGrMBzuN+WxBcFbJ8ujZcSyAyP3yZduZWmdQT4pvuF09b/fxw0FFZ849Dzd51EuNlJc0+USUfIoNWJopsLolxlJjdQHI1odtwwShN5DLMPt+hZph+eFeZUI70bQoy6WLSgyufYwL6kht5t3VKm9rOS6PvwP2Tj+gTVDn9iKKuJ74aQa06v9tcMoBhipw4aEEGmSGadGP+LyZzNx9VKDmdJyQzaO53DAEaZwB03jrxP6x4WnSX+pbElFJ83m/oB+hM4057GYbaRR4+CQ8l3772pVQE2kxPz5aC6ARduu5CS/wtSyfB4A2ws+DsbgvaoLZQguHidDhUSI6Gkns4sPYwgFloLlzr3VTLktdgPL9/WBjM1x/1oneBTwcpr1TsQsRG8pMiSnvuy9YkQKHIEWirPtKJWsgAR5v7VFHVI1ZKx7vsqAQ1fxdCooxGHlSKJo3mbF3ktC4eAx9+OPfzm2TOLsc4KG8NoVAgkh4FSf2P+rOfKh3ti1PDDOzlf0iR74LPLRpGlpk5MmRveKori+msntKW69e5QfwHVZMV3Up4nS3LRBIN0wVxjQjNQ8qEUUfMojVkALdgJwSQh2zqqLuPeCjl0USUyo1HOqpr1cCjFqvcvCfJfqcPCKHspah5wsv4hOCfRm6xJ17E27Z+scLi2eVKPoLZt+QrKjEZEvZmOfxkWhUFWqUQMeRRg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Yunsheng Lin <yunshenglin0825@gmail.com> writes:

> On 3/8/2025 10:54 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote:
>> When enabling DMA mapping in page_pool, pages are kept DMA mapped until
>> they are released from the pool, to avoid the overhead of re-mapping the
>> pages every time they are used. This causes problems when a device is
>> torn down, because the page pool can't unmap the pages until they are
>> returned to the pool. This causes resource leaks and/or crashes when
>> there are pages still outstanding while the device is torn down, because
>> page_pool will attempt an unmap of a non-existent DMA device on the
>> subsequent page return.
>>=20
>> To fix this, implement a simple tracking of outstanding dma-mapped pages
>> in page pool using an xarray. This was first suggested by Mina[0], and
>> turns out to be fairly straight forward: We simply store pointers to
>> pages directly in the xarray with xa_alloc() when they are first DMA
>> mapped, and remove them from the array on unmap. Then, when a page pool
>> is torn down, it can simply walk the xarray and unmap all pages still
>> present there before returning, which also allows us to get rid of the
>> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional
>> synchronisation is needed, as a page will only ever be unmapped once.
>
> The implementation of xa_cmpxchg() seems to take the xa_lock, which
> seems to be a per-Xarray spin_lock.
> Yes, if if we were to take a per-Xarray lock unconditionaly, additional
> synchronisation like rcu doesn't seems to be needed. But it seems an
> excessive overhead for normal packet processing when page_pool_destroy()
> is not called yet?

I dunno, maybe? It's only done on DMA map/unmap, so it may be
acceptable? We should get some benchmark results to assess :)

> Also, we might need a similar locking or synchronisation for the dma
> sync API in order to skip the dma sync API when page_pool_destroy() is
> called too.

Good point, but that seems a separate issue? And simpler to solve (just
set pool->dma_sync to false when destroying?).

>> To avoid having to walk the entire xarray on unmap to find the page
>> reference, we stash the ID assigned by xa_alloc() into the page
>> structure itself, in the field previously called '_pp_mapping_pad' in
>> the page_pool struct inside struct page. This field overlaps with the
>> page->mapping pointer, which may turn out to be problematic, so an
>> alternative is probably needed. Sticking the ID into some of the upper
>> bits of page->pp_magic may work as an alternative, but that requires
>> further investigation. Using the 'mapping' field works well enough as
>> a demonstration for this RFC, though.
> page->pp_magic seems to only have 16 bits space available at most when
> trying to reuse some unused bits in page->pp_magic, as BPF_PTR_POISON
> and STACK_DEPOT_POISON seems to already use 16 bits, so:
> 1. For 32 bits system, it seems there is only 16 bits left even if the
>     ILLEGAL_POINTER_VALUE is defined as 0x0.
> 2. For 64 bits system, it seems there is only 12 bits left for powerpc
>     as ILLEGAL_POINTER_VALUE is defined as 0x5deadbeef0000000, see
>     arch/powerpc/Kconfig.
>
> So it seems we might need to limit the dma mapping count to 4096 or
> 65536?
>
> And to be honest, I am not sure if those 'unused' 12/16 bits can really=
=20
> be reused or not, I guess we might need suggestion from mm and arch
> experts here.

Why do we need to care about BPF_PTR_POISON and STACK_DEPOT_POISON?
AFAICT, we only need to make sure we preserve the PP_SIGNATURE value.
See v2 of the RFC patch, the bit arithmetic there gives me:

- 24 bits on 32-bit architectures
- 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE)
- 32 bits on other 64-bit architectures

Which seems to be plenty?

>> Since all the tracking is performed on DMA map/unmap, no additional code
>> is needed in the fast path, meaning the performance overhead of this
>> tracking is negligible. The extra memory needed to track the pages is
>> neatly encapsulated inside xarray, which uses the 'struct xa_node'
>> structure to track items. This structure is 576 bytes long, with slots
>> for 64 items, meaning that a full node occurs only 9 bytes of overhead
>> per slot it tracks (in practice, it probably won't be this efficient,
>> but in any case it should be an acceptable overhead).
>
> Even if items is stored sequentially in xa_node at first, is it possible
> that there may be fragmentation in those xa_node when pages are released
> and allocated many times during packet processing? If yes, is there any
> fragmentation info about those xa_node?

Some (that's what I mean with "not as efficient"). AFAICT, xa_array does
do some rebalancing of the underlying radix tree, freeing nodes when
they are no longer used. However, I am not too familiar with the xarray
code, so I don't know exactly how efficient this is in practice.

>> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7K=
BG9DcVJcyWg@mail.gmail.com/
>>=20
>> Fixes: ff7d6b27f894 ("page_pool: refurbish version of page_pool code")
>> Reported-by: Yonglong Liu <liuyonglong@huawei.com>
>> Suggested-by: Mina Almasry <almasrymina@google.com>
>> Reviewed-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Tested-by: Jesper Dangaard Brouer <hawk@kernel.org>
>> Signed-off-by: Toke H=C3=B8iland-J=C3=B8rgensen <toke@redhat.com>
>> ---
>> This is an alternative to Yunsheng's series. Yunsheng requested I send
>> this as an RFC to better be able to discuss the different approaches; se=
e
>> some initial discussion in[1], also regarding where to store the ID as
>> alluded to above.
>
> As mentioned before, I am not really convinced there is still any
> space left in 'struct page' yet, otherwise we might already use that
> space to fix the DMA address > 32 bits problem in 32 bits system, see
> page_pool_set_dma_addr_netmem().

See above.

> Also, Using the more space in 'struct page' for the page_pool seems to
> make page_pool more coupled to the mm subsystem, which seems to not
> align with the folios work that is trying to decouple non-mm subsystem
> from the mm subsystem by avoid other subsystem using more of the 'struct
> page' as metadata from the long term point of view.

This seems a bit theoretical; any future changes of struct page would
have to shuffle things around so we still have the ID available,
obviously :)

-Toke