From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BE8D6C3ABBF for ; Wed, 7 May 2025 17:48:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1CAF36B00A5; Wed, 7 May 2025 13:48:03 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 17A916B00A6; Wed, 7 May 2025 13:48:03 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F0FAF6B00A7; Wed, 7 May 2025 13:48:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id CDE346B00A5 for ; Wed, 7 May 2025 13:48:02 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 7B977140C2E for ; Wed, 7 May 2025 17:48:03 +0000 (UTC) X-FDA: 83416845246.06.AD1350C Received: from mail-qv1-f52.google.com (mail-qv1-f52.google.com [209.85.219.52]) by imf23.hostedemail.com (Postfix) with ESMTP id A414B140008 for ; Wed, 7 May 2025 17:48:01 +0000 (UTC) Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aqB5Vadx; spf=pass (imf23.hostedemail.com: domain of willemdebruijn.kernel@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=willemdebruijn.kernel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746640081; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QQSmPTDbULmzMF5gJg4AJqSu8sXW4nk4u9j8nFhsi9I=; b=C5g6nbunrID5Ee/+4bgjlxkJ0//95gRLiEgn/O+dl/whX4lG00wOovtlK5o8C5p+SAvfyl PLg7hdgQYimHWIQ4iHG2YWKfHdE3RERBl1xdm8dQBCppHc2+yeHWfVxciPelS/iuAPzxmo HD8HcdVQHInv3adjjBFBDDPMQ79qKOw= ARC-Authentication-Results: i=1; imf23.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=aqB5Vadx; spf=pass (imf23.hostedemail.com: domain of willemdebruijn.kernel@gmail.com designates 209.85.219.52 as permitted sender) smtp.mailfrom=willemdebruijn.kernel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746640081; a=rsa-sha256; cv=none; b=w8bgB1eeCmbLtvVGLfmPXxNcui1IpSS3aVkJ65wi+6cOlyuiM3Zxlzfb4gs3wosH7tVqwc xESXCg3zdnc89IHFAMVIV+yOo2XCfCfNtHhxi1RViHOxt0sg3zdlzcHjPziAN3RXK9MqKB 8aztLmdOrvP2Z+bsjcyw63JLfE+PxyM= Received: by mail-qv1-f52.google.com with SMTP id 6a1803df08f44-6f5499c21bbso3045406d6.3 for ; Wed, 07 May 2025 10:48:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1746640081; x=1747244881; darn=kvack.org; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=QQSmPTDbULmzMF5gJg4AJqSu8sXW4nk4u9j8nFhsi9I=; b=aqB5Vadxi5i1Ee0CDfookeH0yzGZWqPAaYwza3o4EnyT7nJoyNif6QEBzb91dykECj EyVlRBpAKKrGdAirivQPn3zM3LcChJDXA2V1LAXkSaP/MaCHryqofxmJmxSo8D587GyZ xce4joIFbvopAIC4ig8QK9ta1gTD21YXEfe/IYeS3MptFsYRfAo4IMKFmsSOESzWasD3 zf1Z2OwdvgYAt7eDqReESc9lb0VNBax/fTei0+dNJpHQn/2fFz25CGGEB7WSYE/6GCW1 Mie+h7WX4prvuT0SF2xNb20Nsv3BzRE6U6d01WJLuC0ZHVvJtdebXWImav9P4/Wt6F7N 6d6Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1746640081; x=1747244881; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=QQSmPTDbULmzMF5gJg4AJqSu8sXW4nk4u9j8nFhsi9I=; b=UQrcFluFQSqRG0lg6kbjLzmiVW5tMBigE3bFvTAkElWk6m/wKAsBSbXvvin8ZQGL3O hOS1qH8LSbptZfhxHoiG9X8apJn7Oq7Xb8+qZBAWnRy++hHPx2f42JZLBh4pEIOPz+Kl kFAvcgkNZy4Wz1blQrXRitLAis6E/cBBwwxFVPuUroXUoEHCW7ZlNkGmKgtVDGGcowg+ jx6kdbfxlfS/bFE8Aysz7DLh9bP/nOGz5dpCfSe8c+EL9nrtmJ0m8/zm6eDm205xzlcw WqQ4l87ZzzsAJY088oWMGLH496FxsITH/AANcKyUZtPelQxBN8NKkRs6HdLI8lEBgxRm oAOw== X-Forwarded-Encrypted: i=1; AJvYcCUACithypUnOJjIUW0Uk5M+IcjS6XSXpXP7qDBAFXw69MCmQcq/3wcdjrlXoWx6pKsdmC2VTxfWxA==@kvack.org X-Gm-Message-State: AOJu0YwU5lF0YOwAWYzesq75wJXmA9Q8Mo3rvxX4pq/lvOg32t9fzXkc ZVY3hIWpe3ghZv0J/22IEEx5W4+t1g1wU5zij9d26g9EOFwVUaOf X-Gm-Gg: ASbGncuTWJMLYL2ykW4wBXA10wqU9rfO2IgWUl4O5AmxC5ycGc4YroBzHbVKuyt0Sex oTw1C3AN3ZYtEbRLgU7eqpmNn235KMD4aEntaByot/SdD4L8jmIlLwrLbJzn2Tr+G0kB03X8Q06 Eqw+4uimYLjQArkL6x67hN2EnyxE9sOsB6qJPOl3cAZs6/cH9L+cD8qPNqh4Bsv4U+pSlfBnHE+ rVQh/Dmai6GFSvmEgwK3Ah47zVo0THVrEQrP5PZZd0Hwh+00N9KnD8BjE0tNaYlLlAb+ALZHkXS SvEmn/2hXCRc0cdMoYHvw2trwGOCyazdIAjLBayT9KnRW3wcvlJUeyKQXiCoo3Cbg2ALWBPDuVh 08f40zDhCzzvlHBgmSygtz5y+ygQquaQ= X-Google-Smtp-Source: AGHT+IGfGRCqCg93LEMzDK3ypi3hdHaxZz0AOBWRpxRd2uNuNoA/VmcLD/gtGuUnj++rH1zAvgcWbg== X-Received: by 2002:a05:6214:5c9:b0:6f5:4508:fd84 with SMTP id 6a1803df08f44-6f54509024dmr46081966d6.35.1746640080568; Wed, 07 May 2025 10:48:00 -0700 (PDT) Received: from localhost (141.139.145.34.bc.googleusercontent.com. [34.145.139.141]) by smtp.gmail.com with UTF8SMTPSA id 6a1803df08f44-6f542798fbdsm16807316d6.99.2025.05.07.10.47.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 07 May 2025 10:47:59 -0700 (PDT) Date: Wed, 07 May 2025 13:47:59 -0400 From: Willem de Bruijn To: David Howells , Jakub Kicinski Cc: dhowells@redhat.com, Andrew Lunn , Eric Dumazet , "David S. Miller" , David Hildenbrand , John Hubbard , Christoph Hellwig , willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org, Willem de Bruijn Message-ID: <681b9ccf970c5_1f6aad29428@willemb.c.googlers.com.notmuch> In-Reply-To: <1352674.1746625556@warthog.procyon.org.uk> References: <20250506112012.5779d652@kernel.org> <20250505131446.7448e9bf@kernel.org> <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch> <1015189.1746187621@warthog.procyon.org.uk> <1021352.1746193306@warthog.procyon.org.uk> <1069540.1746202908@warthog.procyon.org.uk> <1216273.1746539449@warthog.procyon.org.uk> <1352674.1746625556@warthog.procyon.org.uk> Subject: Re: Reorganising how the networking layer handles memory Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: A414B140008 X-Stat-Signature: 1f8zi9uxcmo838h371hmjbuh1y4fzpqs X-HE-Tag: 1746640081-17776 X-HE-Meta: U2FsdGVkX1/b3+UXSDmJAGk86+OrfhdNgTpWiUeK159ZMC8gAqr3X8GpxOa0HdtDTJrPPzlRuIoPAf1eZm49i8L1sNJ4J29D7YYcmyP4q71f9g0W1JWZQy50BIL+xKuAxXhXLreuSjrUkAw66QktRDgJmBR1IqZFWVft5BvJmSMbURBmzQ2LpTVEwfwW2JyGUIRHT/dFxDmItxQqAtpF6tEOK+eSvA8+nH83PTtvurIRXBnpjyE+5cYFX9CiPpROPqRCMjwd8wl7F+aoYfgleZGGASO5uQzV6zovSVjhkB/UXdMOuhjaE6CX+XqDI9lMMaFxlHHsLf1rg4/XrJzrmzWp6YYgWNRlpQ7mHuxcWiNPqN2Xo4Hg4Xc0QKR/FKnsiQyX4J40SxT1UuQB/CrqDOHukTQKYgcFWkI7x7Xu5evnyLT+fd+KIT/RsXagW1ap2tLGoOVrNUmgjMJmY4+hbv35YwN2wgXGyUkek+/4bbF6IMn1l4lUnmAg3gFHxB4+wuvDGOqVQJhaiQOZ8XMeo1GXDNvKeEiEgWLON7cGgsGfM5lLtX6mBrZ0I3IQr5PTml0JYKI1Vs3L2rXvsZrmf0xc9f/Ot3OUnMKwi1PhOlhgVGaERV3OzhtGueKd4wGk1AbrUDwXl0qpkiJ7O82/R4ynvSGiTCF5MFFbsXmPnY61OPWApS6F8baBbmizjpF7W+x9H6JBjY49FGBlE3nYmsTo/O119EIWRJ/9pXC7h7bAbNMiuxcybO3nMtix4K/yOVpUOeKfOU0w8a4i6g7ZnCAuzR+ehMEpb4PFYgfR6nkn/VRgiYP0QW+p7znyATRSSK00KR0EZA9jRJkDpyXftA+KOxgpSVEhatmW+2svG2BRkF1+1nDglOsiHXseZTQk3+K2YU+i9TMvpWu4OvPGZn9yelbcTr8LhvcR9pEi+PTBlHBAFMWFWRwtINK24g4FUEgVHrGHHdJ2ZASzc/Y u7Pmeon1 8O4RAgNhNSBjk9Dvsl//q05yZqm/Xua9ulEQ9pSsA/8y1Mq+b4A8uhLjIvuFf94bAkSTmtxGZ6lx7ok+p2/dFyB2YkjGrhYCC5YCuUMDaH3SbedPUm+WvP7Bn25dmZSCUGPxx1ds/cLekqWEzwozujifvpEDCBWM5lTqSaU0db8Yr5pw8Aq64Cz9eSY6D3r38YewkcT/HziVCvBWZQQF1dVlFVy5DfOg8+i53zC8e3rkJbK0qs2Qg6M+MKhOaxYq0IDBHaJVp8Cb93cnmg8GXyrveNcAaPJpbQFXc3ut15aqm2A9AhYPVqws1PH7TiN2iRwDhzq6xMNd8NEaBk9ics4KNBVdFxPQaCXeVxgGm9ZNReRnVsVTvz/rqPRh9fDdCtcJgglQ6OTo5YOyf7MsIUvO30/AlvYQ0kFCG/RYN9a2buInx49RaL7jI6njlAjcHIcjb X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: David Howells wrote: > Jakub Kicinski wrote: > > > On Tue, 06 May 2025 14:50:49 +0100 David Howells wrote: > > > Jakub Kicinski wrote: > > > > > (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because > > > > > it doesn't use page pinning. It needs to use the GUP routines. > > > > > > > > We end up calling iov_iter_get_pages2(). Is it not setting > > > > FOLL_PIN is a conscious choice, or nobody cared until now? > > > > > > iov_iter_get_pages*() predates GUP, I think. There's now an > > > iov_iter_extract_pages() that does the pinning stuff, but you have to do a > > > different cleanup, which is why I created a new API call. > > > > > > iov_iter_extract_pages() also does no pinning at all on pages extracted from a > > > non-user iterator (e.g. ITER_BVEC). > > > > FWIW it occurred to me after hitting send that we may not care. > > We're talking about Tx, so the user pages are read only for the kernel. > > I don't think we have the "child gets the read data" problem? > > Worse: if the child alters the data in the buffer to be transmitted after the > fork() (say it calls free() and malloc()), it can do so; if the parent tries > that, there will be no effect. > > > Likely all this will work well for ZC but not sure if we can "convert" > > the stack to phyaddr+len. > > Me neither. We also use bio_vec[] to hold lists of memory and then trawl them > to do cleanup, but a conversion to holding {phys,len} will mandate being able > to do some sort of reverse lookup. > > > Okay, just keep in mind that we are working on 800Gbps NIC support these > > days, and MTU does not grow. So whatever we do - it must be fast fast. > > Crazy:-) > > One thing I've noticed in the uring stuff is that it doesn't seem to like the > idea of having an sk_buff pointing to more than one ubuf_info, presumably > because the sk_buff will point to the ubuf_info holding the zerocopyable data. > Is that actually necessary for SOCK_STREAM, though? In MSG_ZEROCOPY this limitation of at most one ubuf_info per skb was chosen just because it was simpler and sufficient. A single skb can still combine skb frags from multiple consecutive sendmsg requests, including multiple separate MSG_ZEROCOPY calls. Because the ubuf_info notification is for a range of bytes. There is a rare edge case in skb_zerocopy_iter_stream that detects two ubuf_infos on a single skb. /* An skb can only point to one uarg. This edge case happens * when TCP appends to an skb, but zerocopy_realloc triggered * a new alloc. */ if (orig_uarg && uarg != orig_uarg) return -EEXIST; Instead TCP then just creates a new skb. This will result in smaller skbs than otherwise. But as said is rare. > My thought for SOCK_STREAM is to have an ordered list of zerocopy source > records on the socket and a completion counter and not tag the skbuffs at all. > That way, an skbuff can carry data for multiple zerocopy send requests. > > David >