From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1FA05D36137
	for <linux-mm@archiver.kernel.org>; Tue,  5 Nov 2024 20:12:07 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6C3426B0085; Tue,  5 Nov 2024 15:12:07 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 673456B0089; Tue,  5 Nov 2024 15:12:07 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 561AF6B008A; Tue,  5 Nov 2024 15:12:07 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id 398056B0085
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 15:12:07 -0500 (EST)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay05.hostedemail.com (Postfix) with ESMTP id D2B6F40F29
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 20:12:06 +0000 (UTC)
X-FDA: 82753137558.28.31BEFB0
Received: from nyc.source.kernel.org (nyc.source.kernel.org [147.75.193.91])
	by imf11.hostedemail.com (Postfix) with ESMTP id BD14240012
	for <linux-mm@kvack.org>; Tue,  5 Nov 2024 20:11:23 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=VBe0hJcO;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf11.hostedemail.com: domain of hawk@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=hawk@kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1730837357;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=jXKerj6a+R0v/e9MGx5aDhUWVqcB+jSTluiZagp5kIs=;
	b=tB/g17ZeIyDLOGF9RoSRNjwFKgP14WmuYgcraQACOHyZAG9HdsuTBCq6jsIQSGcl7ZmC3U
	Anm8MIJkc3pDFCieReDHfC1IRjIkBZFNbh2TiVtyAWncPxfahKAIOMf8BxcOo2Ob400IcK
	SUohn6BexAR3OtaxidQTAa6C6RHUXuY=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=VBe0hJcO;
	dmarc=pass (policy=quarantine) header.from=kernel.org;
	spf=pass (imf11.hostedemail.com: domain of hawk@kernel.org designates 147.75.193.91 as permitted sender) smtp.mailfrom=hawk@kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1730837357; a=rsa-sha256;
	cv=none;
	b=SR0R9I4D8l0gB8rEis31oHLXQKwr/on7s1+23ZdV45T2W9+jK61YKC/vevqX8DuupeL/A2
	xx9G6pfqRJN3tRZ62nqDxbGJXyJV0Sxl26SzMimZ9uwaE2p4dLbDCksxZzCzyV+j+YRjLk
	9/fsxAp4lhqz/9hDhIp1B8qgycSX6O0=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by nyc.source.kernel.org (Postfix) with ESMTP id 56872A43959;
	Tue,  5 Nov 2024 20:10:09 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id B170FC4CECF;
	Tue,  5 Nov 2024 20:11:59 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1730837524;
	bh=6LwYiAB6meT+zkZegA4NYsStaK6/T8Pagdvmw42EAWk=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=VBe0hJcOMVNXSHfz3dnWDwKUWfk/hwHQWlXOXQer9zhCeMaGY5LqdEC/d4ROryBWb
	 GXZ1Fq7+MmLwv6Gd6t4Sq/yMVn1JcgaHTwl9AVJqJY0qCyhlsDt0msoiWtc0oa+B3t
	 A1GS3Vy1qXthtzRMdLEl5jL1QXD2FhGx0O3kMBn40f6nAWDTfMde0HPzimcUalw4lz
	 dRTJz4JSoGw02X2aAXJHp+GaVhPhC4BvbjsPTzfsKyOhTzNVQvPIQyU7w3qJWOWCuy
	 p1AjmJQ/tW04yHHB6cFnjKmUfklBFHk38ppHWSFewpFGquNRlPQ+OjOuKTLqSAGUxB
	 VVN+RFaYBxbSQ==
Message-ID: <a6cfba96-9164-4497-b075-9359c18a5eef@kernel.org>
Date: Tue, 5 Nov 2024 21:11:57 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH net-next v3 3/3] page_pool: fix IOMMU crash when driver
 has already unbound
To: Yunsheng Lin <linyunsheng@huawei.com>,
 =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= <toke@redhat.com>,
 davem@davemloft.net, kuba@kernel.org, pabeni@redhat.com
Cc: zhangkun09@huawei.com, fanghaiqing@huawei.com, liuyonglong@huawei.com,
 Robin Murphy <robin.murphy@arm.com>,
 Alexander Duyck <alexander.duyck@gmail.com>, IOMMU <iommu@lists.linux.dev>,
 Andrew Morton <akpm@linux-foundation.org>, Eric Dumazet
 <edumazet@google.com>, Ilias Apalodimas <ilias.apalodimas@linaro.org>,
 linux-mm@kvack.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
 kernel-team <kernel-team@cloudflare.com>
References: <20241022032214.3915232-1-linyunsheng@huawei.com>
 <20241022032214.3915232-4-linyunsheng@huawei.com>
 <dbd7dca7-d144-4a0f-9261-e8373be6f8a1@kernel.org>
 <113c9835-f170-46cf-92ba-df4ca5dfab3d@huawei.com> <878qudftsn.fsf@toke.dk>
 <d8e0895b-dd37-44bf-ba19-75c93605fc5e@huawei.com> <87r084e8lc.fsf@toke.dk>
 <cf1911c5-622f-484c-9ee5-11e1ac83da24@huawei.com> <878qu7c8om.fsf@toke.dk>
 <1eac33ae-e8e1-4437-9403-57291ba4ced6@huawei.com> <87o731by64.fsf@toke.dk>
 <023fdee7-dbd4-4e78-b911-a7136ff81343@huawei.com> <874j4sb60w.fsf@toke.dk>
 <a50250bf-fe76-4324-96d7-b3acf087a18c@huawei.com>
Content-Language: en-US
From: Jesper Dangaard Brouer <hawk@kernel.org>
In-Reply-To: <a50250bf-fe76-4324-96d7-b3acf087a18c@huawei.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: BD14240012
X-Stat-Signature: h9orwaa8mmnzsacebtrsftzn8ywf7q15
X-Rspam-User: 
X-HE-Tag: 1730837483-313712
X-HE-Meta: U2FsdGVkX1/crUomxSL03FjtSgHtB5N/6jY9ELAf1+AjrV3Mrd33GENkhIoGRFKOcem8DXQ/TvoR4pPBtc9+XN9zOyrMUqDgtRA5ki3xgwGb1eBIicisqA27DpKTBYktolgceBo9PFw2YZx05V47ExbP2+lQq2ye57McLOkLE3+A9DVZPI7QwLJcFB8LDk4t3l/u83eMirr0/aDhRoq8T4GtqpKwqyvNYLK8xs1ryDbY5hmgBfKolyso+VGNy8OR3NTFBx+y9gCIXqHutpDB6ySJ6p1VJEFrkBXw+w2feF9Cbpc5U+xCcovd/TSyXqZ3W8ofWPOUR34lImx/CX0aBBSDA3Cs1lUFlLpzl3s8Llaz9XBGC/MTdO4LJdWsrybup/HGMITct//Mzkz/xHKJFuEu+XoZGg/lYjM4B1LWPd1pTe1RVV1vQ4TSiXOp2UP5HA39Jkz4MfkvJE8FfZbi6jSKLneg48oE7Sq2h1xKe4laWktoN9FIXnspi3+lNl9hSP67KCdp4Da19eBrc/OSOEkNtBu8LP6huJROkwWnWXvu1Boh1PEblEuNDrTy+yPRB1PxCOQjtXPUNKIKlnczqRpEbxgBzMCNGR2I9vCPWVrtTQZvl+vNSOXFMhxhKDRIejhzvc3wtk+ZM3+G9+NpZmTt9i0S2Pb6f5Wxv7N3JJHilI5khI3hEDoG+tqKHnv2jwty0evTFxGrvlotFD+A17kERNpSGJNJVjOyJ1bIKqO4MhVwKwJeGNrQhCxW/o0vxFDgAqyzubLLwQKG/vwQn618/Ft7AAizq9H2qLfAQBJejWqaWj5VgtokN5PijLBm1OFHUrPF+nMV+kPTvpW3Ke1JVySi4HZINBluRVTlwUjuyCrHCz0/LAJim0SP0sjD0BMTmMGMisLdvRAl+ZrwM4+UwGTgWFzqeH6T8NiZHdbTxQdbklKmoExG32M5tMbuhSYHBEdagVuBUHEhDiK
 n0gH47Tz
 dUmzcVQ4TWqG6In7BBxPxm+yzPyKaSvSAc6b+jdM1XHhXI2wslee/+/XCgZKTl0slyyXdETTOiHS1kd9SJ1+ZyzoNS/x2nRdPbGWPojoqlwKOwkPT7YPjIMK35IbP9oy5HdbxB1T/E35YAu9XPrrX0PQ2y9RZUQSVxw0jv/5Y1SaaYTiuMi5Qsp7gCat3Zoa9JjkXSRXXT+5p3Spk0UkopIfu9GqCjaB4r5hsWH3eB2Y/x2YAzK7AnsqNmN34Pq2s1fWwEVBVayXk9Q3qGgPy4r1Oc2j7kaAL9EcUGEY8Qf12Ng4pHxlL9gciuu31jTr72CqoFMJUTdHJpxpac8TMmLD4lQQrKCOkwvgv/qS+fXRYSs9kIBNw3NtrojNkt5gfI1PTvccgbDdhQO83+wRxOCSUhg==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 01/11/2024 12.11, Yunsheng Lin wrote:
> On 2024/11/1 0:18, Toke Høiland-Jørgensen wrote:
> 
> ...
> 
>>>>
>>>> Eliding the details above, but yeah, you're right, there are probably
>>>> some pernicious details to get right if we want to flush all caches. S
>>>> I wouldn't do that to start with. Instead, just add the waiting to start
>>>> with, then wait and see if this actually turns out to be a problem in
>>>> practice. And if it is, identify the source of that problem, deal with
>>>> it, rinse and repeat :)
>>>
>>> I am not sure if I have mentioned to you that jakub had a RFC for the waiting,
>>> see [1]. And Yonglong Cc'ed had tested it, the waiting caused the driver unload
>>> stalling forever and some task hung, see [2].
>>>
>>> The root cause for the above case is skb_defer_free_flush() not being called
>>> as mentioned before.
>>
>> Well, let's fix that, then! We already logic to flush backlogs when a
>> netdevice is going away, so AFAICT all that's needed is to add the
> 
> Is there a possiblity that the page_pool owned page might be still handled/cached
> in somewhere of networking if netif_rx_internal() is already called for the
> corresponding skb and skb_attempt_defer_free() is called after skb_defer_free_flush()
> added in below patch is called?
> 
> Maybe add a timeout thing like timer to call kick_defer_list_purge() if you treat
> 'outstanding forever' as leaked? I actually thought about this, but had not found
> out an elegant way to add the timeout.
> 
>> skb_defer_free_flush() to that logic. Totally untested patch below, that
>> we should maybe consider applying in any case.
> 
> I am not sure about that as the above mentioned timing window, but it does seem we
> might need to do something similar in dev_cpu_dead().
> 
>>
>>> I am not sure if I understand the reasoning behind the above suggestion to 'wait
>>> and see if this actually turns out to be a problem' when we already know that there
>>> are some cases which need cache kicking/flushing for the waiting to work and those
>>> kicking/flushing may not be easy and may take indefinite time too, not to mention
>>> there might be other cases that need kicking/flushing that we don't know yet.
>>>
>>> Is there any reason not to consider recording the inflight pages so that unmapping
>>> can be done for inflight pages before driver unbound supposing dynamic number of
>>> inflight pages can be supported?
>>>
>>> IOW, Is there any reason you and jesper taking it as axiomatic that recording the
>>> inflight pages is bad supposing the inflight pages can be unlimited and recording
>>> can be done with least performance overhead?
>>
>> Well, page pool is a memory allocator, and it already has a mechanism to
>> handle returning of memory to it. You're proposing to add a second,
>> orthogonal, mechanism to do this, one that adds both overhead and
> 
> I would call it as a replacement/improvement for the old one instead of
> 'a second, orthogonal' as the old one doesn't really exist after this patch.
> 

Yes, are proposing doing a very radical change to the page_pool design.
And this is getting proposed as a fix patch for IOMMU.

It is a very radical change that page_pool needs to keep track of *ALL* 
in-flight pages.

The DMA issue is a life-time issue of DMA object associated with the
struct device.  Then, why are you not looking at extending the life-time
of the DMA object, or at least detect when DMA object goes away, such
that we can change a setting in page_pool to stop calling DMA unmap for
the pages in-flight once they get returned (which we have en existing
mechanism for).


>> complexity, yet doesn't handle all cases (cf your comment about devmem).
> 
> I am not sure if unmapping only need to be done using its own version DMA API
> for devmem yet, but it seems waiting might also need to use its own version
> of kicking/flushing for devmem as devmem might be held from the user space?
> 
>>
>> And even if it did handle all cases, force-releasing pages in this way
>> really feels like it's just papering over the issue. If there are pages
>> being leaked (or that are outstanding forever, which basically amounts
>> to the same thing), that is something we should be fixing the root cause
>> of, not just working around it like this series does.
> 
> If there is a definite time for waiting, I am probably agreed with the above.
>  From the previous discussion, it seems the time to do the kicking/flushing
> would be indefinite depending how much cache to be scaned/flushed.
> 
> For the 'papering over' part, it seems it is about if we want to paper over
> different kicking/flushing or paper over unmapping using different DMA API.
> 
> Also page_pool is not really a allocator, instead it is more like a pool
> based on different allocator, such as buddy allocator or devmem allocator.
> I am not sure it makes much to do the flushing when page_pool_destroy() is
> called if the buddy allocator behind the page_pool is not under memory
> pressure yet.
> 

I still see page_pool as an allocator like the SLUB/SLAB allocators,
where slab allocators are created (and can be destroyed again), which we
can allocate slab objects from.  SLAB allocators also use buddy
allocator as their backing allocator.

The page_pool is of-cause evolving with the addition of the devmem
allocator as a different "backing" allocator type.


> For the 'leaked' part mentioned above, I am agreed that it should be fixed
> if we have a clear and unified definition of 'leaked'， for example, is it
> allowed to keep the cache outstanding forever if the allocator is not under
> memory pressure and not ask for the releasing of its memory?
> 

It seems wrong to me if page_pool need to dictate how long the API users
is allowed to hold the page.

> Doesn't it make more sense to use something like shrinker_register() mechanism
> to decide whether to do the flushing?
> 
> IOW, maybe it makes more sense that the allocator behind the page_pool should
> be deciding whether to do the kicking/flushing, and maybe page_pool should also
> use the shrinker_register() mechanism to empty its cache when necessary instead
> of deciding whether to do the kicking/flushing.
> 

Sure, I've argued before that page_pool should use shrinker_register()
but only when used with the normal buddy allocator.
BUT you need to realize that bad things can happen when network stack
fails to allocate memory for packets, because it can block connections
from making forward progress and those connections can be holding on to
memory (that is part of the memory pressure problem).


> So I am not even sure if it is appropriate to do the cache kicking/flushing
> during waiting, not to mention the indefinite time to do the kicking/flushing.

--Jesper