From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CB37C282EC for ; Mon, 17 Mar 2025 15:16:57 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 812D7280002; Mon, 17 Mar 2025 11:16:56 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 79E02280001; Mon, 17 Mar 2025 11:16:56 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 63C67280002; Mon, 17 Mar 2025 11:16:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id E2051280001 for ; Mon, 17 Mar 2025 11:16:55 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id E9FB880194 for ; Mon, 17 Mar 2025 15:16:54 +0000 (UTC) X-FDA: 83231395548.30.7ADFA47 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 4CB814001D for ; Mon, 17 Mar 2025 15:16:50 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aGt51hAb; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1742224612; a=rsa-sha256; cv=none; b=jS1YwTNdbGb86e3rWfozoBfbgYZQZTHq98wjUK7L1wrLJ2bqF9aOt8g9cwkRY3ThLmc0nK xV1FGqwZKTwxLYB8kwgx05Q0kekbwjZsiU4KfXWwo2s4m55kodSgKJvFPDM9YjYI+4Z5xa Fk1AatuOX2sG5G+ZaRGhAgRlZ2ilMJE= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=aGt51hAb; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf07.hostedemail.com: domain of toke@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=toke@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1742224612; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=fGtzTo1LAc8iiHgf/rdSrqAOpGbI/PBDwP4kOYIY5Uo=; b=OX7RkK8+CBgKTdHS+NGMmDTo+1YhhXAJ2Fx2M+KrR/ak7dzgQaiT2cPSx/YrYE7kZC9vYg +E/Lv+caqDdNXbO86Yi8HCLQvEKgpLiAdyVzxv8yWZned9IngL6KO/DtBNS3go8FmxYLhc 6EqOBjWnbbjiSEvxXMCTzjuHXEkD0b0= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1742224608; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=fGtzTo1LAc8iiHgf/rdSrqAOpGbI/PBDwP4kOYIY5Uo=; b=aGt51hAbXpLRwAxOGRZ0z0/sUVUC91h1txoYMtrCeWQ1vcmnPVAttmS+2NMpH2+04GFLwk 6ZfAvNMH11bccCE3u2QICu4bu8tcXnIJhXwYSCavUgN9fiB5Wc1JOszUZbPDalvuTNJP6m viBysC/MVQ3LBQxybzRcEGoqKJIoUf4= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-210-ncHM3aR1PG-ifmCmHlz6Mg-1; Mon, 17 Mar 2025 11:16:47 -0400 X-MC-Unique: ncHM3aR1PG-ifmCmHlz6Mg-1 X-Mimecast-MFC-AGG-ID: ncHM3aR1PG-ifmCmHlz6Mg_1742224606 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-30c4cbc324bso16312251fa.1 for ; Mon, 17 Mar 2025 08:16:47 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1742224606; x=1742829406; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CzRZI32PvG1jM7oaHs4LUm1pa2tu5/yx3DzI4q+HHBg=; b=UDisz8peq3zykWrQjHZbhidF6vyFXD5EMOGV1pgUebfBTQej3cdJqLDzuqEvYzU+kv NFcoj71xzKm7w8l92N7NQXlU02Q29Rm7ojgQE3shSQ1bNMPlI75GUF5jYY/4COXVxkWj yO2OUevFXUWaKtR1NbqD5AP24wkoFm5gTcPTu82+ceOeMlNtFXcFOsEe4U2JGgHEnjRN qgik+BmaKSjT3h5MQrouv9DflxNQt9lIKeY+0JwEQ2a+roFjQ8mqVrBJgDtgxxvGhqqO obnRNFhlwGOPCK2L0y0hp+JV3D39lrBksjgB9UswUE0gY2yi1YP4VM6w6on7QD0sa7T+ 6n5w== X-Forwarded-Encrypted: i=1; AJvYcCWcbLIM8CAUwq3KXMTZioJhPqJZY+z20waPAhp+bsOT1+HIfMrQvx9P21aRYi7Ch2FRBwk333i/jg==@kvack.org X-Gm-Message-State: AOJu0YxdmmRv1U4GtUDBQiD1/LJCLrkDExYL+j66SN5sS4DCiUWD2Qod sJ3RZwPIZBz+UGKgWH2xV9vpQkH7wHHjSlfwoZQFaQh3HkjpEDN+pGjzuY+dE3n9Lk5nXdAQwJj cNWfMiqyOaqqYKcST66TlznQJOaMKNSoOkpwwJp+OMSqkYKmx X-Gm-Gg: ASbGncuNW5cdGoq5wY5jAKY75Vb8/W1Wah0adumlMNZYEcshOzsIowEmQLbGhKK3iTV P8nW5Gvop4HcW6j56wwzTMOJYqYN7XKVXNqiS1d3ElrMHvh9qyA180uSPnESiGuMnjWsPuF6uN9 C6ZYR7MOOHzWFz6izxoE2w88NB1SNEBjmMBoJee48udTUODU30yd2sKMZ7Rpvg6XbCRn8rnLR0v jw0fIelbEBfWDFPU9V7KNQtwyK+4jx86xP1IX8F8qiFylMHXAZkIXJOqQKdCDaqGfisC5LEPT4K rv37NvC9a4fq X-Received: by 2002:a2e:740a:0:b0:30b:eb08:53e3 with SMTP id 38308e7fff4ca-30c4a8743d9mr64207491fa.17.1742224605617; Mon, 17 Mar 2025 08:16:45 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFSi8cQyOlG/W3LR4YI+3jPWVFa3/mQXBuzJDNH22ZWPZDVglcHs63ffS5X17IwRGbk4plwDg== X-Received: by 2002:a2e:740a:0:b0:30b:eb08:53e3 with SMTP id 38308e7fff4ca-30c4a8743d9mr64207081fa.17.1742224605112; Mon, 17 Mar 2025 08:16:45 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-30c3f116a4asm15789761fa.58.2025.03.17.08.16.43 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 17 Mar 2025 08:16:44 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id DBB6318FAED8; Mon, 17 Mar 2025 16:16:42 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Yunsheng Lin , "David S. Miller" , Jakub Kicinski , Jesper Dangaard Brouer , Saeed Mahameed , Leon Romanovsky , Tariq Toukan , Andrew Lunn , Eric Dumazet , Paolo Abeni , Ilias Apalodimas , Simon Horman , Andrew Morton , Mina Almasry , Yonglong Liu , Yunsheng Lin , Pavel Begunkov , Matthew Wilcox , Robin Murphy , IOMMU Cc: netdev@vger.kernel.org, bpf@vger.kernel.org, linux-rdma@vger.kernel.org, linux-mm@kvack.org, Qiuling Ren , Yuying Ma Subject: Re: [PATCH net-next 3/3] page_pool: Track DMA-mapped pages and unmap them when destroying the pool In-Reply-To: References: <20250314-page-pool-track-dma-v1-0-c212e57a74c2@redhat.com> <20250314-page-pool-track-dma-v1-3-c212e57a74c2@redhat.com> X-Clacks-Overhead: GNU Terry Pratchett Date: Mon, 17 Mar 2025 16:16:42 +0100 Message-ID: <87jz8nhelh.fsf@toke.dk> MIME-Version: 1.0 X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: TJ6Vv7q8Farn_iOEQpHUW3Mgjov1XiikHoWy-BC0KMM_1742224606 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 5ctr114scd8npccfmpgk1rdojkjnxpyu X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 4CB814001D X-Rspam-User: X-HE-Tag: 1742224610-336884 X-HE-Meta: U2FsdGVkX1/JUSwOTid50xR4KS77KEEAquevRWPv8CMzUdxgCSSsa1Pw9Lbp8dAUvbgGdaoGmo8/tTxzgP/oiM8+45iLjIhY/JVwqj2MCN1D295j31fdglnLQmrDoIPtpJm4HajGwJ5nBK9FjU0qnGXWMv2aCsJUjt8Lh9V0RWtQ9HNYRjO5DiuISgoRS3jMT+5cLYYs3JDS0Ez15AfhZTngziNSa0D2LynOjjLQ6k1WyNkCmnlk1rkhu7MkRCm0NHCNgnBUm6dg3tbx0UEggiVm6yjoufb+liE+wSoSKjA6YfD+rbv8AmMTSdiahKLFuOO/fcE5gVrcdv/zDOdcfkoZ4UmJILbSFVZJ9YFblC9kHgRLqH3BtM4v4tgYOb4V5qpYEHwysyHP8J18zeKT60OL/S0tTOuWi9Q8PPY5XsP6z++n5nIJ4ZNcQm84UMsct2ZblxWvHjthhIAaINaP0EU3ADjfRL80AcER0sVLc9Z1VwBHmDNT+t/hgh+uZAEboAIUlQm8SJGPV2Zk7YIDe2lVu9P/Bs1L4WAay8RY41n5BiK70TkbA4km7PesoUuPSt/8zC8uMxW0qNqAk33WysqlbRev9oWPMMiNKx/5s81mEmorYBxwllYaZQmaX581uxUJyYr19dTICb/y+hWMSacMLFJP9dYW+sMD+fwUIJzmxUVXgyt5dUMmgL/mKolRSzpo3sojf0o5HAr7N/qS9PaWbbTB4EToJZ4NMgNzJv4Hswi8hOoqcLneVf/D343jN9x9VdwEKNqEfpITJ82JyBM4BO8KAq3B6YJZN3IzTdaeK8SPa8p0xpcFjrS8jTk+Ie0D+vdmasJdQcNhlgqNWhs+g1WMiVDj3MgHQlX7WY/GdMzqC7BBSnKaxK1wFPZUa8nKuTtuBkilEgWUbn59U881rZrphge5Y5ZJ161voWSZ2LGxwpuld9G6NwtqK5DOpAOZQDBCEPoNlRqy+g8 ybQZJ3aa iEwsrBfkvGfz1/N4C9nrQtHJS2iAI6EcyLSGjn6yvvi3eTji2fs2GL0pUzvIB54RkWzru9zPa/0Lb/4dEDzbyC+31u2Dv+wCvnrilaDaxXfCe78laNrQRZpffPj/ZqzOatdvZlHSIOF9plyxSM0j5HuqQq4MlzDP+BwBSNdVuaswXBPMr752LLS2qMtvoB8T4nsNKvhTfOXcHgMY+Uz5ASCTqoCka43w8IVSE4yOkyq2kNTSUA6IXhnNJvdZzvo4V2kOrggZD4DhTMgjfJpnVFq9QTD6KJ1oTPfUhfEwcDZjq4aXQPT3Wq7f/1cZ7OSbtp+u+1pjEDanHYfvX7IPVwTTK81cjLGRy5Of5TVngSdSCvnJSW19HCYSGsYv41o8S0BuEFNj6+Q91hUuhO+GCoxvpKRkyPzZgRqSt1safvn33ENlRKx1ykR5Vb2BdHo3DHRoeueJhWrI75H9FbjjS30WD6hB6pvBcKZJ39gB8zZ8cusOf8ULAduWFgEHZi/Os3GLa7gouqvPQEuu5AquADjp54Kb7ih8u162+isKjEKAawph4843dRBwdgxE1UBnLk00jNIhxNm5cOr6DZiidfWXPzw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Yunsheng Lin writes: > On 3/14/2025 6:10 PM, Toke H=C3=B8iland-J=C3=B8rgensen wrote: > > ... > >>=20 >> To avoid having to walk the entire xarray on unmap to find the page >> reference, we stash the ID assigned by xa_alloc() into the page >> structure itself, using the upper bits of the pp_magic field. This >> requires a couple of defines to avoid conflicting with the >> POINTER_POISON_DELTA define, but this is all evaluated at compile-time, >> so does not affect run-time performance. The bitmap calculations in this >> patch gives the following number of bits for different architectures: >>=20 >> - 24 bits on 32-bit architectures >> - 21 bits on PPC64 (because of the definition of ILLEGAL_POINTER_VALUE) >> - 32 bits on other 64-bit architectures > > From commit c07aea3ef4d4 ("mm: add a signature in struct page"): > "The page->signature field is aliased to page->lru.next and > page->compound_head, but it can't be set by mistake because the > signature value is a bad pointer, and can't trigger a false positive > in PageTail() because the last bit is 0." > > And commit 8a5e5e02fc83 ("include/linux/poison.h: fix LIST_POISON{1,2}=20 > offset"): > "Poison pointer values should be small enough to find a room in > non-mmap'able/hardly-mmap'able space." > > So the question seems to be: > 1. Is stashing the ID causing page->pp_magic to be in the mmap'able/ > easier-mmap'able space? If yes, how can we make sure this will not > cause any security problem? > 2. Is the masking the page->pp_magic causing a valid pionter for > page->lru.next or page->compound_head to be treated as a vaild > PP_SIGNATURE? which might cause page_pool to recycle a page not > allocated via page_pool. Right, so my reasoning for why the defines in this patch works for this is as follows: in both cases we need to make sure that the ID stashed in that field never looks like a valid kernel pointer. For 64-bit arches (where CONFIG_ILLEGAL_POINTER_VALUE), we make sure of this by never writing to any bits that overlap with the illegal value (so that the PP_SIGNATURE written to the field keeps it as an illegal pointer value). For 32-bit arches, we make sure of this by making sure the top-most bit is always 0 (the -1 in the define for _PP_DMA_INDEX_BITS) in the patch, which puts it outside the range used for kernel pointers (AFAICT). >> Since all the tracking is performed on DMA map/unmap, no additional code >> is needed in the fast path, meaning the performance overhead of this >> tracking is negligible. A micro-benchmark shows that the total overhead >> of using xarray for this purpose is about 400 ns (39 cycles(tsc) 395.218 >> ns; sum for both map and unmap[1]). Since this cost is only paid on DMA >> map and unmap, it seems like an acceptable cost to fix the late unmap > > For most use cases when PP_FLAG_DMA_MAP is set and IOMMU is off, the > DMA map and unmap operation is almost negligible as said below, so the > cost is about 200% performance degradation, which doesn't seems like an > acceptable cost. I disagree. This only impacts the slow path, as long as pages are recycled there is no additional cost. While your patch series has demonstrated that it is *possible* to reduce the cost even in the slow path, I don't think the complexity cost of this is worth it. [...] >> The extra memory needed to track the pages is neatly encapsulated inside >> xarray, which uses the 'struct xa_node' structure to track items. This >> structure is 576 bytes long, with slots for 64 items, meaning that a >> full node occurs only 9 bytes of overhead per slot it tracks (in >> practice, it probably won't be this efficient, but in any case it should > > Is there any debug infrastructure to know if it is not this efficient? > as there may be 576 byte overhead for a page for the worst case. There's an XA_DEBUG define which enables some dump functions, but I don't think there's any API to inspect the memory usage. I guess you could attach a BPF program and walk the structure, or something? >> +=09=09=09/* Make sure all concurrent returns that may see the old >> +=09=09=09 * value of dma_sync (and thus perform a sync) have >> +=09=09=09 * finished before doing the unmapping below. Skip the >> +=09=09=09 * wait if the device doesn't actually need syncing, or >> +=09=09=09 * if there are no outstanding mapped pages. >> +=09=09=09 */ >> +=09=09=09if (dma_dev_need_sync(pool->p.dev) && >> +=09=09=09 !xa_empty(&pool->dma_mapped)) >> +=09=09=09=09synchronize_net(); > > I guess the above synchronize_net() is assuming that the above dma sync > API is always called in the softirq context, as it seems there is no > rcu read lock added in this patch to be paired with that. Yup, that was my assumption. > Doesn't page_pool_put_page() might be called in non-softirq context when > allow_direct is false and in_softirq() returns false? I am not sure if this happens in practice in any of the delayed return paths we are worried about for this patch. If it does we could apply something like the diff below (on top of this patch). I can respin with this if needed, but I'll wait a bit and give others a chance to chime in. -Toke @@ -465,9 +465,13 @@ page_pool_dma_sync_for_device(const struct page_pool *= pool, =09=09=09 netmem_ref netmem, =09=09=09 u32 dma_sync_size) { -=09if ((READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) && -=09 dma_dev_need_sync(pool->p.dev)) -=09=09__page_pool_dma_sync_for_device(pool, netmem, dma_sync_size); +=09if (dma_dev_need_sync(pool->p.dev)) { +=09=09rcu_read_lock(); +=09=09if (READ_ONCE(pool->dma_sync) & PP_DMA_SYNC_DEV) +=09=09=09__page_pool_dma_sync_for_device(pool, netmem, +=09=09=09=09=09=09=09dma_sync_size); +=09=09rcu_read_unlock(); +=09} } =20 static bool page_pool_dma_map(struct page_pool *pool, netmem_ref netmem, g= fp_t gfp)