From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C17B7C36002 for ; Thu, 27 Mar 2025 03:54:27 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5A01B2800C3; Wed, 26 Mar 2025 23:54:26 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 54F682800A5; Wed, 26 Mar 2025 23:54:26 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 417CF2800C3; Wed, 26 Mar 2025 23:54:26 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 21B812800A5 for ; Wed, 26 Mar 2025 23:54:26 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id C41EF1201F3 for ; Thu, 27 Mar 2025 03:54:25 +0000 (UTC) X-FDA: 83265963690.06.FD9A685 Received: from szxga04-in.huawei.com (szxga04-in.huawei.com [45.249.212.190]) by imf07.hostedemail.com (Postfix) with ESMTP id 830A540002 for ; Thu, 27 Mar 2025 03:54:22 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf07.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1743047664; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=lhC8tbbz82VtzPlWLYl2Z1AK/JLDIaaX6Nb765y/Too=; b=UwpdJG/IgXkBN/YbkJ/rsBOeMLpBnADBaftNTotKI0P8OIl8nVu4ErsXzNpnu21rpfhqGd aQRF2X1FDP9JxfEIq42YiRHnbZ/t7MY1XeNBSxYgtqsr5ax5xL3t67Yn6lA1irVvyzKPQs wYjVoWJmx0kgcZUQoCKFOQ61fgNjPQI= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; dmarc=pass (policy=quarantine) header.from=huawei.com; spf=pass (imf07.hostedemail.com: domain of linyunsheng@huawei.com designates 45.249.212.190 as permitted sender) smtp.mailfrom=linyunsheng@huawei.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743047664; a=rsa-sha256; cv=none; b=iV+aLpQP3Choru9qCBoO7ZKLCb4rvp/v6wBz/D2zeR79icT1S27MmoUiK+A5UNqldeh385 jqYbcPjipKOfoBwPS4hRg0WhBEvyb6g0MbP+X0R+sR/IwCpoZFBystBOTsraRJjTzBnhvN 0Yp7OEBZspkdrEYm5DeEGDcQF4vzqoQ= Received: from mail.maildlp.com (unknown [172.19.163.17]) by szxga04-in.huawei.com (SkyGuard) with ESMTP id 4ZNV7S3X2Hz2CdMH; Thu, 27 Mar 2025 11:51:00 +0800 (CST) Received: from dggpemf200006.china.huawei.com (unknown [7.185.36.61]) by mail.maildlp.com (Postfix) with ESMTPS id C8D521A0188; Thu, 27 Mar 2025 11:54:17 +0800 (CST) Received: from [10.67.120.129] (10.67.120.129) by dggpemf200006.china.huawei.com (7.185.36.61) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.1544.11; Thu, 27 Mar 2025 11:54:17 +0800 Message-ID: Date: Thu, 27 Mar 2025 11:53:44 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH net-next v2 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool To: Saeed Mahameed , =?UTF-8?Q?Toke_H=C3=B8iland-J=C3=B8rgensen?= CC: "David S. Miller" , Jakub Kicinski , Jesper Dangaard Brouer , Leon Romanovsky , Tariq Toukan , Andrew Lunn , Eric Dumazet , Paolo Abeni , Ilias Apalodimas , Simon Horman , Andrew Morton , Mina Almasry , Yonglong Liu , Pavel Begunkov , Matthew Wilcox , , , , , Qiuling Ren , Yuying Ma References: <20250325-page-pool-track-dma-v2-0-113ebc1946f3@redhat.com> <20250325-page-pool-track-dma-v2-3-113ebc1946f3@redhat.com> Content-Language: en-US From: Yunsheng Lin In-Reply-To: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit X-Originating-IP: [10.67.120.129] X-ClientProxiedBy: dggems703-chm.china.huawei.com (10.3.19.180) To dggpemf200006.china.huawei.com (7.185.36.61) X-Rspamd-Server: rspam01 X-Stat-Signature: h3c6in6de6jcwsbnfdkcom8fnb7ttcsy X-Rspam-User: X-Rspamd-Queue-Id: 830A540002 X-HE-Tag: 1743047662-827822 X-HE-Meta: U2FsdGVkX1/XscyNPvXwCevvvQ25v3iDC9rTFE58zIwi5AyrIjhT6PGwKPI3Liv1ZhPVJ7wgLGuT9LDZlhdLLN2cTgRrt530GXZnGUlE2Mqz6bAXqUtr91AjR6HK95UcXkjtCTptF7b4w2ePZzKo1UI523q1BUFtA2XG75Zhm3CIH/udZOOkiCTM5SbMvU6EatEDZY6vc66+Ti7pO7D6iYqc07tuTZLWCKfH9t2Ii5eds4FNmX/1hhW4w88P9dsC42wsJYSC67ulLex0+RSL59SwM65LEkNcgw/yXvBOOP7zD3cSt/YUJUYW7OIByG1Hs2KF/fhywUzsiK4ZO51CkY/B4xBA3TYjzm9WQkPTESjuMATrsIpsie8a4D5oNAPPdIL9mmf5fE8x5oY+nC1SDx5QTeCeCcamzcyrMDfW5XbLIX/lgTqH4dovfEkSniYeXqY9EHkKS7+YmohyL+9+SGr4Uci0OjOq7b/LfVUUD1PE+YjqA4BUyk/oX14LTqMpdnBkMO9grg+gxTzK8oNGxJc1su3KsnJW4OnEuVDIDwb/l8yYlqeBXZotwoznc6P8jss2mP3dfGcmnT+PqaQt29l98ctHos9wan3Sgru6AGFGbEXeWVN/pcR14KJ8UUVqrnX3bQ0BmykRicOGK2cBBGFUoBidg6BE/OUbMcVo9MGe081Nh6xjK0s6N49UMIsW90IyUVozlMJQlfaYwI3y96h/X0X7v/MC+/nQdiR6Wj8gsBNt3BvjMgZov6Qcs2YUcaDUPhKXZrJeBhQD2VoY7Lwo4S33bRLHnbXPEa0bitwcKAW1OFBEH9vjXoGn+iOZS3eZ3qI4Ssj5SnSzdBNIbtR4CJwDdBKiDXYMEZvqoxdH+9EPS0QKjEck46xXUOK2FRgzGznbWWW+W4TWO/NEY6gmOmLgUs+88+w/k7cCaow823j0cvkmj8JPhs833DB/kRO/2Bqg+gQnbkY53lq uxUzBYhY F67631zhkct0iMiHltvOckVuFbhTbx6MdxhPBdi9E1uXp/sxxYkKy+1mu5ASqFmsBQ4lVD7RD64oGUrTZEMDkchQvKLXSdqvGkOFchMVeqWnxCeW1sVUIN/xksmplILvKeVNRl7h4xs6pObtHZKTg+z2tNGSrh3VhkTSx+H3eLayV11yJbBlwddfnDmd33I2oonJzkJHTpN2pylJUSjTxJdobbimM87I7FLwpxaidyBs4gwvT2Flsb35A6Kl+QgEcgfurcnUvOhsrWB0yYdOKHoJLsCTpjMrKuxO7VVSMdQj1UUkfPLibiyL3istgfrN8riQFQpSWI+AWAQ26W0L8aHV3Ta1HPaRqcRPF5EfX9sc8c17wevKB89m30GLrJ5e2K4cc/V4RvGnmvjnChJnXB6LM7rNBz6l3SgzAtpFWhg+GMEVRz48gJ8pwR8qNVpLZMuTrvghM5iicsE0E+dW4Mhuht+1l0wtycJn+k8ixbxAl2Fg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/3/27 2:22, Saeed Mahameed wrote: ... > > I know there's an extra check on fast path, but looking below it seems > like slow path is becoming unnecessarily complicated, and we are sacrificing > slow path performance significantly for the sake of not touching fast path > at all. page_pool slow path should _not_ be significantly slower > than using the page allocator directly! In many production environments and > some workloads, page pool recycling can happen at a very low rate resulting > on performance relying solely on slow path.. +1. And we also rely on the above 'slow path' to flush old nid-mismatched pages and refill new pages from the correct nid, see the nid checking and updating in page_pool_refill_alloc_cache() and page_pool_update_nid() when irq migrates between different numa nodes as it seems common for linux distributions to use irqbalance to tune performance. >> To fix this, implement a simple tracking of outstanding DMA-mapped pages >> in page pool using an xarray. This was first suggested by Mina[0], and >> turns out to be fairly straight forward: We simply store pointers to >> pages directly in the xarray with xa_alloc() when they are first DMA >> mapped, and remove them from the array on unmap. Then, when a page pool >> is torn down, it can simply walk the xarray and unmap all pages still >> present there before returning, which also allows us to get rid of the >> get/put_device() calls in page_pool. Using xa_cmpxchg(), no additional >> synchronisation is needed, as a page will only ever be unmapped once. >> >> To avoid having to walk the entire xarray on unmap to find the page >> reference, we stash the ID assigned by xa_alloc() into the page >> structure itself, using the upper bits of the pp_magic field. This >> requires a couple of defines to avoid conflicting with the >> POINTER_POISON_DELTA define, but this is all evaluated at compile-time, >> so does not affect run-time performance. The bitmap calculations in this >> patch gives the following number of bits for different architectures: >> >> - 23 bits on 32-bit architectures >> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) >> - 32 bits on other 64-bit architectures >> >> Stashing a value into the unused bits of pp_magic does have the effect >> that it can make the value stored there lie outside the unmappable >> range (as governed by the mmap_min_addr sysctl), for architectures that >> don't define ILLEGAL_POINTER_VALUE. This means that if one of the >> pointers that is aliased to the pp_magic field (such as page->lru.next) >> is dereferenced while the page is owned by page_pool, that could lead to >> a dereference into userspace, which is a security concern. The risk of >> this is mitigated by the fact that (a) we always clear pp_magic before >> releasing a page from page_pool, and (b) this would need a >> use-after-free bug for struct page, which can have many other risks >> since page->lru.next is used as a generic list pointer in multiple >> places in the kernel. As such, with this patch we take the position that >> this risk is negligible in practice. For more discussion, see[1]. The below is a recent UAF for a page, I guess it can be said that this patch seem to basically make mmap_min_addr mechanism useless for the above arches when a driver with page_pool dma mapping support is loaded: https://lore.kernel.org/all/Z878K7JQ93LqBdCB@casper.infradead.org/ So how about logging a warning so that user can tell if their system may be in security compromised state for the above case? >> >> Since all the tracking added in this patch is performed on DMA >> map/unmap, no additional code is needed in the fast path, meaning the >> performance overhead of this tracking is negligible there. A >> micro-benchmark shows that the total overhead of the tracking itself is >> about 400 ns (39 cycles(tsc) 395.218 ns; sum for both map and unmap[2]). >> Since this cost is only paid on DMA map and unmap, it seems like an >> acceptable cost to fix the late unmap issue. Further optimisation can >> narrow the cases where this cost is paid (for instance by eliding the >> tracking when DMA map/unmap is a no-op). >> > What I am missing here, what is the added cost of those extra operations on > the slow path compared to before this patch? Total overhead being > acceptable doesn't justify the change, we need diff before and after. Toke used my data in [2] below: The above 400ns is the added cost of those extra operations on the slow path, before this patch the slow path only cost about 170ns, so there is more than 200% performance degradation for the page tracking in this patch, which I failed to see why it is acceptable:( > >> The extra memory needed to track the pages is neatly encapsulated inside >> xarray, which uses the 'struct xa_node' structure to track items. This >> structure is 576 bytes long, with slots for 64 items, meaning that a >> full node occurs only 9 bytes of overhead per slot it tracks (in >> practice, it probably won't be this efficient, but in any case it should >> be an acceptable overhead). >> >> [0] https://lore.kernel.org/all/CAHS8izPg7B5DwKfSuzz-iOop_YRbk3Sd6Y4rX7KBG9DcVJcyWg@mail.gmail.com/ >> [1] https://lore.kernel.org/r/20250320023202.GA25514@openwall.com >> [2] https://lore.kernel.org/r/ae07144c-9295-4c9d-a400-153bb689fe9e@huawei.com