From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB084C35FF3 for ; Tue, 18 Mar 2025 20:55:33 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 624E7280002; Tue, 18 Mar 2025 16:55:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5D4C5280001; Tue, 18 Mar 2025 16:55:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 47681280002; Tue, 18 Mar 2025 16:55:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 2806B280001 for ; Tue, 18 Mar 2025 16:55:31 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DE4D0B99DC for ; Tue, 18 Mar 2025 20:55:31 +0000 (UTC) X-FDA: 83235877662.24.5D4AEC1 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf25.hostedemail.com (Postfix) with ESMTP id 298C7A0014 for ; Tue, 18 Mar 2025 20:55:29 +0000 (UTC) Authentication-Results: imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GUzj14Xd; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742331329; a=rsa-sha256; cv=none; b=wnh0hcJM5JjHqw6HcPjSzqK7SIGPabWyRkQm1TMyegJK0Rg39yOrNSjnamHXAIsWs11b33 HHyu1RmGGmZhIaAs78EYHClFVeFQkVCqZhWhW9OIqxHfaMK/kepXrW5goKdAjC0MyU5g8I soHwyXBy+JufurdmIPUKQN4T2dC76yk= ARC-Authentication-Results: i=1; imf25.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=GUzj14Xd; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf25.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742331329; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=IYzXhht3cKnfthyiHWn6jaBTDF2fnvR7pGQuVqZGiTU=; b=JJUnWOZFOzHW3haTz1UVvGQSJpS5K9z147QbJ/8/+Wa1eLb/B3+9BySrtiB/Gc5GTcf6nc lx7o1TVxA70gIjUvatQTFgL8y2BGrwiOCzsDCwbRztmTXYomDIxu0LZGbsftCttEhNDKBT wd1Ci0zFBGESi66qPAKaGzNVNa+fWhU= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1742331328; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=IYzXhht3cKnfthyiHWn6jaBTDF2fnvR7pGQuVqZGiTU=; b=GUzj14XdN9l+t8W3IOopOrD5oXAEPtryCPpZCRoqULf2hRvaItBhvPvpsk991UP6k4jF8/ GfMdw2uzToZZfcEEnzXKVE9SM430kKYHdM1lNIYW7b6u/yxPgSxfYWTFH/uQmkj2MXYTgh 3FDFU/umN11oC+xc2ZzbZwGalYkSv7k= Received: from mail-lf1-f71.google.com (mail-lf1-f71.google.com [209.85.167.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-88-1i_Uki9POsS34vpv8W8PwA-1; Tue, 18 Mar 2025 16:55:25 -0400 X-MC-Unique: 1i_Uki9POsS34vpv8W8PwA-1 X-Mimecast-MFC-AGG-ID: 1i_Uki9POsS34vpv8W8PwA_1742331323 Received: by mail-lf1-f71.google.com with SMTP id 2adb3069b0e04-5498f71580cso3015723e87.3 for ; Tue, 18 Mar 2025 13:55:24 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742331323; x=1742936123; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=47f/ey+Rw47qKqlvyNz3Y2mw/kjp4yg5557bp9nPawQ=; b=Jhm44U1mCwV/NCEBuWOt0B12rQQPGI18hy+AiNaIK7dizXTtClhwr5Pqfajbg2YagT OEmp5aMh4w9aT0IshDh0+K1OnEwgAJWyDYm6YqEwJ3GvzSjVKmRaQWzGN72VlbEJbAeW r4DtIcltdliFpgjWRRGju6EgNe/teRcEHybVvVfHUCPczCQvS0cPUT18/lgCYK6n1udw 24j/cu00O8VQyYKMZqsIy9z/WUy7jsBhpHqTAH72mykeFVfsBsS/iKz6rkzxrE5kGh24 Jq6M/AVuKbOf+DECE9PCY+jdlXHqESCEiUwWZ1NwmFZyNOj/Juz9XdxcMCgk8X/0j9TI H7wg== X-Forwarded-Encrypted: i=1; AJvYcCVCVYlQIko4exZOGKD8tUyg3i08gVnT4It1qcyrD45gG+Jgg5ihJHA2wIoCfh6+MGuJohzIK8PXQQ==@kvack.org X-Gm-Message-State: AOJu0YzvZmOmBzyVtbXnB7zk/VpLcvjAHBAS7oWHz2OAHEhS2ItT76Nv qg/SSlqtL8K+eTyAyX80oe2ex8DxD205nyvYt1+wptM/b7b07F1hYPFndSLJPcGQbwL2WFT2GBp oY7vs8YZRRfK8afbZgt6h1pcxUbbt6ZwTrZ0sGprJ3S+/z7D6 X-Gm-Gg: ASbGncsEIsf6+wpH8w4byMPnBiF2Y1dGX/bz4i5sxU2RZYVuyXvmUab4X1Gmn23qu5/ BcpTrH73D/LNCDTsm3fZssufwgnz17Ptur1V422VwkXQgBmI1PmdeiVSQF9sr4xy8ZzNLjZKHGr /Yi4ujUzzYzx/PWVFO3ZF4e1Cs6LbWqyWWj7QFI7QHBO3NG4Kr5lcsABxwN4OkMzGVSJew/Et2K HRUqRpcu7k0EtImI6V5pJugcg006at08YLOjkZ+ap60NZ6m3t2MsJV6QE9hc69aKTmU0sSLn7jl 92Vy8Iqpcc+E X-Received: by 2002:a05:6512:2351:b0:542:91a5:2478 with SMTP id 2adb3069b0e04-54acb200b32mr9420e87.32.1742331323212; Tue, 18 Mar 2025 13:55:23 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGv32MD9rYrAaB4w0H+GGjiPxz+dxQSbBbZExsG4PudOWzTXu9r6mA2Br+a28dY1YTqZpmi5w== X-Received: by 2002:a05:6512:2351:b0:542:91a5:2478 with SMTP id 2adb3069b0e04-54acb200b32mr9386e87.32.1742331322568; Tue, 18 Mar 2025 13:55:22 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-549ba7a81basm1814300e87.28.2025.03.18.13.55.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 18 Mar 2025 13:55:21 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id C778D18FB055; Tue, 18 Mar 2025 21:55:20 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Yunsheng Lin , Yunsheng Lin , "David S. Miller" , Jakub Kicinski , Jesper Dangaard Brouer , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Andrew Lunn , Eric Dumazet , Paolo Abeni , Ilias Apalodimas , Simon Horman , Andrew Morton , Mina Almasry , Yonglong Liu , Pavel Begunkov , Matthew Wilcox , Robin Murphy , IOMMU , segoon@openwall.com, solar@openwall.com Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Qiuling Ren , Yuying Ma Subject: Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool In-Reply-To: <7a76908d-5be2-43f1-a8e2-03b104165a29@huawei.com> References: <20250314-page-pool-track-dma-v1-0-c212e57a74c2@redhat.com> <20250314-page-pool-track-dma-v1-3-c212e57a74c2@redhat.com> <87jz8nhelh.fsf@toke.dk> <7a76908d-5be2-43f1-a8e2-03b104165a29@huawei.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Tue, 18 Mar 2025 21:55:20 +0100 Message-ID: <87wmcmhxdz.fsf@toke.dk> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: 3WPMvrtj1IhOdOXJBnmSBQ9Yg6QijFyGiDzMES8FJ_I_1742331323 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: jcpy3bizcrhgjynejdosu38ou9feh8e3 X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 298C7A0014 X-Rspam-User: X-HE-Tag: 1742331329-303595 X-HE-Meta: U2FsdGVkX1+fKbNfc58uXUSYuKc/WIL45xuKOdiSIi8QUt0BpQa1eUi+cILh2jmP6ai++0+Gush/VBtBcFnRqnFfOkif+TyPb5wuQCEMoRBtztiM4p/vE12ewnVxj1MTgkIQ8X+UncLxSB8hTlVKtZdzXzaQq+Qutt7DtAcFMAUpdnyKFvu/umiI7m1G9YuX0rqNkduvf08u6Pem1OPrBmzcdCOSQHFee2wvtziMb0nu+0IWnlOWtjuS9EhCyayv0xs966IlnxEWqv/i1D1q2qgWCw6RtDXCkTijlioKPgLymdFvhsNsJigDP3WxfFwOF7/7WJE8p+MFWkRXLAwxTGD4e+jOVNOgyEHqg7hmS1xTCbXptutdOeMX+NZhE1gwkmLOssZu4nDiwNG69NFdI7M+Xtqh7EJDaGN8NoB+/WtiirB08keLNHbinGJvbUreg28SeaB02II0KKvcQ34maeZdIvQy0RnabMhEV5ntOcYM63VSEEJbQQ8Sf2Yb0SVNEIDapedBG777dDi+LyKmpDerxurifyetO1YIfhYxE8gTTDY8tRNj0mbY/LBGnVdbmZ6zw8aahhrnkfr5NislBcaSr8irAsywzXISel53RvHRFZCJgOk45T8OEf3znGnwiwg8x4FcnLMsjroeQmglcetIeQ3v1sndkdb9drGK07fTq/tupG7HzJDJIzNlUaqxsLobxDcDhg8ZFe/8Gg8uF0ixbEMymKuOcay5gR7FDWP8MATyIqqemkqSKT/BqI8omItq1wXyWnLZbNdIBIiIDRFVGFlIQDnmIo2nd38/3wFYYS468dj36tCAiT8AVcmuIT78OyC6ZDhC0g24ZGq3JZW6l3tVvyMBbCWSH0A5CQHnvqPeCtvR+DIOBjyA00AJkRee1XXkAMRd9Opd1KtOJAvjoC82pyVtiuRE3eEaIPKfZ5thgLtd/tDxq7amVmVcuBtG5w2N1nKE3fcrmLQ rb0Yzz91 70CMwRVVVtdhSuocF3r1QAiYJ4xhj/STocHWlJ/shHhA8fZe42TUTawsSMRymKJ/31YisEL0FtFzffVcb+/Q+gcu8KpKnrUK+8AT8MNZPK7qcOqQ6l+G6A9HQAeGvT3uOGkWKFhqGHuSYhVCa7MWVOM4tJInAKQuxy3Iduryy4RiGJuG56jxgGp5E4s/6vVKEWYxkpEEJCmxGYsZPfwwopsJ3yL/dhIt8NTllyDqhnvHVyQhtaOdPj2IOO9cdI7JV6eEMaJ6qcsjwLmyduzh+aykBen3d3VC5JxSqlHmYXUBmsR/aELp/UTP1OhFcWqa7ueV0vv9OkTyGyafoQp9o5EZBj5c2PUKZNlXm3Uc8APzmALb/izOSrrYdMebnPnEgTvqi+iopE8+XU5mJ99veiRuHP8MtRDTPtA2d3Wtorka3QryZVtpzcNxBbt5cdBUedgJ20cu9V/uUX80zs1h2ofo7Z6+O8w77AzFtLhspky3JGizt2B94VSUnF5kMHUyvKEucmw7s9y5L1pHnCPfGXkd96CJNGJI5NFz2IaTO1rOnAAYPkTiUA95QBAIO1CIqiPjr6OQ5MZo4tFoIcSnCpaUZ78FWpsZDk6rQz+ln+hpG9TmiWMErBlplrQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yunsheng Lin writes: > On 2025/3/17 23:16, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Yunsheng Lin writes: >>=20 >>> On 3/14/2025 6:10 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >>> >>> ... >>> >>>> >>>> To avoid having to walk the entire xarray on unmap to find the page >>>> reference, we stash the ID assigned by xa_alloc() into the page >>>> structure itself, using the upper bits of the pp_magic field. This >>>> requires a couple of defines to avoid conflicting with the >>>> POINTER_POISON_DELTA define, but this is all evaluated at compile-time= , >>>> so does not affect run-time performance. The bitmap calculations in th= is >>>> patch gives the following number of bits for different architectures: >>>> >>>> - 24 bits on 32-bit architectures >>>> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE= ) >>>> - 32 bits on other 64-bit architectures >>> >>> From commit c07aea3ef4d4 ("mm: add a signature in struct page"): >>> "The page->signature field is aliased to page->lru.next and >>> page->compound_head, but it can't be set by mistake because the >>> signature value is a bad pointer, and can't trigger a false positive >>> in PageTail() because the last bit is 0." >>> >>> And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2}= =20 >>> offset"): >>> "Poison pointer values should be small enough to find a room in >>> non-mmap'able/hardly-mmap'able space." >>> >>> So the question seems to be: >>> 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/ >>> easier-mmap'able space? If yes, how can we make sure this will not >>> cause any security problem? >>> 2. Is the masking the page->pp_magic causing a valid pionter for >>> page->lru.next or page->compound_head to be treated as a vaild >>> PP_SIGNATURE? which might cause page_pool to recycle a page not >>> allocated via page_pool. >>=20 >> Right, so my reasoning for why the defines in this patch works for this >> is as follows: in both cases we need to make sure that the ID stashed in >> that field never looks like a valid kernel pointer. For 64-bit arches >> (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never >> writing to any bits that overlap with the illegal value (so that the >> PP_SIGNATURE written to the field keeps it as an illegal pointer value). >> For 32-bit arches, we make sure of this by making sure the top-most bit >> is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch, >> which puts it outside the range used for kernel pointers (AFAICT). > > Is there any season you think only kernel pointer is relevant here? Yes. Any pointer stored in the same space as pp_magic by other users of the page will be kernel pointers (as they come from page->lru.next). The goal of PP_SIGNATURE is to be able to distinguish pages allocated by page_pool, so we don't accidentally recycle a page from somewhere else. That's the goal of the check in page_pool_page_is_pp(): (page->pp_magic & PP_MAGIC_MASK) =3D=3D PP_SIGNATURE To achieve this, we must ensure that the check above never returns true for any value another page user could have written into the same field (i.e., into page->lru.next). For 64-bit arches, POISON_POINTER_DELTA serves this purpose. For 32-bit arches, we can leave the top-most bits out of PP_MAGIC_MASK, to make sure that any valid pointer value will fail the check above. > It seems it is not really only about kernel pointers as round_hint_to_min= () > in mm/mmap.c suggests and the commit log in the above commit 8a5e5e02fc83 > if I understand it correctly: > "Given unprivileged users cannot mmap anything below mmap_min_addr, it > should be safe to use poison pointers lower than mmap_min_addr." > > And the above "making sure the top-most bit is always 0" doesn't seems to > ensure page->signature to be outside the range used for kernel pointers > for 32-bit arches with VMSPLIT_1G defined, see arch/arm/Kconfig, there > is a similar config for x86 too: > prompt "Memory split" > depends on MMU > default VMSPLIT_3G > help > Select the desired split between kernel and user memory. > > If you are not absolutely sure what you are doing, leave this > option alone! > > config VMSPLIT_3G > bool "3G/1G user/kernel split" > config VMSPLIT_3G_OPT > depends on !ARM_LPAE > bool "3G/1G user/kernel split (for full 1G low memory)" > config VMSPLIT_2G > bool "2G/2G user/kernel split" > config VMSPLIT_1G > bool "1G/3G user/kernel split" Ah, interesting, didn't know this was configurable. Okay, so AFAICT, the lowest value of PAGE_OFFSET is 0x40000000 (for VMSPLIT_1G), so we need to leave two bits off at the top instead of just one. Will update this, and try to explain the logic better in the comment. > IMHO, even if some trick like above is really feasible, it may be > adding some limitation or complexity to the ARCH and MM subsystem, for > example, stashing the ID in page->signature may cause 0x*40 signature > to be unusable for other poison pointer purpose, it makes more sense to > make it obvious by doing the above trick in some MM header file like > poison.h instead of in the page_pool subsystem. AFAIU, PP_SIGNATURE is used for page_pool to be able to distinguish its own pages from those allocated elsewhere (cf the description above). Which means that these definitions are logically page_pool-internal, and thus it makes the most sense to keep them in the page pool headers. The only bits the mm subsystem cares about in that field are the bottom two (for pfmemalloc pages and compound pages). >>>> Since all the tracking is performed on DMA map/unmap, no additional co= de >>>> is needed in the fast path, meaning the performance overhead of this >>>> tracking is negligible. A micro-benchmark shows that the total overhea= d >>>> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.2= 18 >>>> ns; sum for both map and unmap[1]). Since this cost is only paid on DM= A >>>> map and unmap, it seems like an acceptable cost to fix the late unmap >>> >>> For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the >>> DMA map and unmap operation is almost negligible as said below, so the >>> cost is about 200% performance degradation, which doesn't seems like an >>> acceptable cost. >>=20 >> I disagree. This only impacts the slow path, as long as pages are >> recycled there is no additional cost. While your patch series has >> demonstrated that it is *possible* to reduce the cost even in the slow >> path, I don't think the complexity cost of this is worth it. > > It is still the datapath otherwise there isn't a specific testing > for that use case, more than 200% performance degradation is too much > IHMO. Do you have a real-world use case (i.e., a networking benchmark, not a micro-benchmark of the allocation function) where this change has a measurable impact on performance? If so, please do share the numbers! I very much doubt it will be anything close to that magnitude, but I'm always willing to be persuaded by data :) > Let aside the above severe performance degradation, reusing space in > page->signature seems to be a tradeoff between adding complexity in > page_pool subsystem and in VM/ARCH subsystem as mentioned above. I think you are overstating the impact on other MM users; this is all mostly page_pool-internal logic, cf the explanation above. >>=20 >> [...] >>=20 >>>> The extra memory needed to track the pages is neatly encapsulated insi= de >>>> xarray, which uses the 'struct xa_node' structure to track items. This >>>> structure is 576 bytes long, with slots for 64 items, meaning that a >>>> full node occurs only 9 bytes of overhead per slot it tracks (in >>>> practice, it probably won't be this efficient, but in any case it shou= ld >>> >>> Is there any debug infrastructure to know if it is not this efficient? >>> as there may be 576 byte overhead for a page for the worst case. >>=20 >> There's an XA_DEBUG define which enables some dump functions, but I >> don't think there's any API to inspect the memory usage. I guess you >> could attach a BPF program and walk the structure, or something? >>=20 > > It seems the XA_DEBUG is not defined in production environment. > And I am not familiar enough with BPF program to understand if the > BPF way is feasiable in production environment. > Any example for the above BPF program and how to attach it? Hmm, no, not really, sorry :( I *think* it should be possible to write a bpftrace script that walks the internal xarray tree and counts the nodes along the way, but it's not trivial to do, and I haven't tried. -Toke