From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D87A5C3ABAA for ; Fri, 2 May 2025 16:22:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4546D6B0089; Fri, 2 May 2025 12:21:59 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 401B86B008A; Fri, 2 May 2025 12:21:59 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2AE826B008C; Fri, 2 May 2025 12:21:59 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 074A06B0089 for ; Fri, 2 May 2025 12:21:59 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id A75E31A1F88 for ; Fri, 2 May 2025 16:22:00 +0000 (UTC) X-FDA: 83398484400.14.E668508 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf12.hostedemail.com (Postfix) with ESMTP id 901504000E for ; Fri, 2 May 2025 16:21:58 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cRfOQPH4; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf12.hostedemail.com: domain of dhowells@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1746202918; a=rsa-sha256; cv=none; b=5oiiBiLVHrQoHSftPmBDzqK+FzbyFFcsF0G+E4RraqW2Nv8Qwin3A5TKf+SlieSYJt3Gim XyCwO71eWdhXVUTdvKuqD7dYPcU6tWskrTKXxpHHsx74nAOtU3bYLtgjU9yrZAYDFSTsor zl6OkI2QJ2sXpUFlSopb9Yo2m6JGIuQ= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=cRfOQPH4; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf12.hostedemail.com: domain of dhowells@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1746202918; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TbGsWOjIqREezQIcqfHS/kiolPpE7fXwkpLda6geRtM=; b=1LrVunEOQHrY/G+pE8yrH+a8BQMHTDdHwTHw8D8NSzH1uW/vnUjSAulWYNeAm9HmLNLCBE xtG7qJKBfnkp6JpTiwrsO7rFVpYbi85ei7EntzOco33Ao79qoW8ZKT51ug9DcIqxTPqJYg QoIlnmucPz0TsyTinZ5vwvOPZBi8WGc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1746202917; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=TbGsWOjIqREezQIcqfHS/kiolPpE7fXwkpLda6geRtM=; b=cRfOQPH4CxxihfmKj0LFKeRG0mbuc/eA5WSfZXfUiOJrZpfzq+e1x0WOIkxG2fvFTT+FJn rQPwPoDcmCtXGKT40SoKtmpTOtCfOACdXqSqy+hhBOAkL66LVRopeDE0gor0+9hVRKLRQk 8r2nW6C5g3AazsO/6Qd8JB8doShv23c= Received: from mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (ec2-35-165-154-97.us-west-2.compute.amazonaws.com [35.165.154.97]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-390-eozEl9QTOwCwucTtPkklEA-1; Fri, 02 May 2025 12:21:55 -0400 X-MC-Unique: eozEl9QTOwCwucTtPkklEA-1 X-Mimecast-MFC-AGG-ID: eozEl9QTOwCwucTtPkklEA_1746202914 Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-06.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 864F81800876; Fri, 2 May 2025 16:21:53 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.42.28.188]) by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DC84A1956094; Fri, 2 May 2025 16:21:49 +0000 (UTC) Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 From: David Howells In-Reply-To: <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> References: <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch> <1015189.1746187621@warthog.procyon.org.uk> <1021352.1746193306@warthog.procyon.org.uk> To: Andrew Lunn Cc: dhowells@redhat.com, Eric Dumazet , "David S. Miller" , Jakub Kicinski , David Hildenbrand , John Hubbard , Christoph Hellwig , willy@infradead.org, netdev@vger.kernel.org, linux-mm@kvack.org Subject: Reorganising how the networking layer handles memory MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <1069539.1746202908.1@warthog.procyon.org.uk> Date: Fri, 02 May 2025 17:21:48 +0100 Message-ID: <1069540.1746202908@warthog.procyon.org.uk> X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15 X-Rspamd-Queue-Id: 901504000E X-Stat-Signature: 1aben7herw3ujx1dyieh385uf9eyxbqa X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1746202918-890913 X-HE-Meta: U2FsdGVkX192mwmJbRgnYEhyFQ87rPfDdSPLL0qTqlS+M56Ewz7xNLr0tcScayi5SMkWlkvFLqZac2m+s9RKwVVp8rOvR8r9oet/lkOkMAzFyBrXy9VYVJawSKH95Is0ihpzdnXnh2KYIgIc3pXr2yR+C85z2GS/qS/1EOJIrsyNjILips0/Ua+MUVLiKut7Jg6jD0oInD8Wc8Vm0Bg4qi069eb7znCKzjW6Q2dttDba/tJDEm4OWkh1mc6W4CzjzK1JkUTauVE2SdoctDkh5tWbUYX+VAXlPR8WWk1xvxpo1FQu6X00TBB/ypqtghp+WtSjq3gKkBVrC24Z1jzgrRZUVFaQRTFA5tIa1Bc8s4KZaDnXtt2E/dbP4trWOy1b2faKAtK8UPVZEon5DRSN0+KeQKMcgfGTFRzwVg47oKtdd6PGMFzeBRHtjw0tnBV3AjvtwmepxbJn1yaviCbjz0edIwiVrfNvzsvKoZlFwteTw0TYl7lzZQxmTa6yQLortjFEYPeC8JX7x1ei1PP94z+Ji6jL5r2+G13ygzdgGj9NqW2FrszMg7do7CE0N4EH0V2Hv+S/yiWSYnEGjLAy4Q1SlaqwBxQnumn0SZP53/5ONkBDFzxBZNebE7i5Z/e/UkPDccpCmBB5PxgpETKBAlbUoTd7fMrwgbrdG+KiyXDgFIKSivs1Qu2P8AGRTD2u14oJvesyWl1Jx+jC4nBSM/M5/sfOYXZZK1cVttuLdKoh/7+v41spaCO3f4A1nTCH28uF8K0WuDEMo/BXAwgVxjAmNTNSzyejBFhTk69KI5E2JfwJRhMLIhyyIVzIpiBvZQzuRR1U4nogHaagEI8t3kd8txVCJH+l20JmjK3+1rhMSy3RVdBzUdB3ynbCw+A28XJIDYC34pdInEm14Zxo6jijKE0gYnDOussqcjhL2XKNbiHjxYZoxRM4ezKUi9Y5aFRjaJUZmza59tWIGFG 2GglUTPZ Y6t5e5rVfY7SP48UJuRoTK6Yb34XVy7o+BHFWJsgMluO0a6xcK/eeM+6qELinEo4/RIfuWlRIkczyoC6wmgyZ2Ua8VpNWBBom+47TOn0qI/z0R4Q9RDeWnF7HAfuQ3cKyA9mBchZotr/kZo3g+pYGtH0KThsMTmaRhL3/xSR0jpfD11R1rzvAzJODRdrnxluJLbb6L2I3q4z/FbQX0bR9j7wr0qnv4OTPQji4pY9HPi2puxlL9UTzP2DlTVxPf0UZY5f3P7sahNRnYBybDhxlGfC4tGIXyFWigvMjY7f82If70fc2Ekd2CdTqlQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Okay, perhaps I should start at the beginning :-). There a number of things that are going to mandate an overhaul of how the networking layer handles memory: (1) The sk_buff code assumes it can take refs on pages it is given, but the page ref counter is going to go away in the relatively near term. Indeed, you're already not allowed to take a ref on, say, slab memory, because the page ref doesn't control the lifetime of the object. Even pages are going to kind of go away. Willy haz planz... (2) sendmsg(MSG_ZEROCOPY) suffers from the O_DIRECT vs fork() bug because it doesn't use page pinning. It needs to use the GUP routines. (3) sendmsg(MSG_SPLICE_PAGES) isn't entirely satisfactory because it can't be used with certain memory types (e.g. slab). It takes a ref on whatever it is given - which is wrong if it should pin this instead. (4) iov_iter extraction will probably change to dispensing {physaddr,len} tuples rather than {page,off,len} tuples. The socket layer won't then see pages at all. (5) Memory segments splice()'d into a socket may have who-knows-what weird lifetime requirements. So after discussions at LSF/MM, what I'm proposing is this: (1) If we want to use zerocopy we (the kernel) have to pass a cleanup function to sendmsg() along with the data. If you don't care about zerocopy, it will copy the data. (2) For each message sent with sendmsg, the cleanup function is called progressively as parts of the data it included are completed. I would do it progressively so that big messages can be handled. (3) We also pass an optional 'refill' function to sendmsg. As data is sent, the code that extracts the data will call this to pin more user bufs (we don't necessarily want to pin everything up front). The refill function is permitted to sleep to allow the amount of pinned memory to subside. (4) We move a lot the zerocopy wrangling code out of the basement of the networking code and put it at the system call level, above the call to ->sendmsg() and the basement code then calls the appropriate functions to extract, refill and clean up. It may be usable in other contexts too - DIO to regular files, for example. (5) The SO_EE_ORIGIN_ZEROCOPY completion notifications are then generated by the cleanup function. (6) The sk_buff struct does not retain *any* refs/pins on memory fragments it refers to. This is done for it by the zerocopy layer. This will allow us to kill three birds with one stone: (A) It will fix the issues with zerocopy transmission mentioned above (DIO vs fork, pin vs ref, pages without refcounts). Whoever calls sendmsg() is then responsible for maintaining the lifetime of the memory by whatever means necessary. (B) Kernel drivers (e.g. network filesystems) can then use MSG_ZEROCOPY (MSG_SPLICE_PAGES can be discarded). They can create their own message, cobbling it together out of kmalloc'd memory and arrays of pages, safe in the knowledge that the network stack will treat it only as an array of memory fragments. They would supply their own cleanup function to do the appropriate folio putting and would not need a "refill" function. The extraction can be handled by standard iov_iter code. This would allow a network filesystem to transmit a complete RPC message with a single sendmsg() call, avoiding the need to cork the socket. (C) Make it easier to provide alternative userspace notification mechanisms than SO_EE_ORIGIN_ZEROCOPY. Maybe by allowing a "cookie" to be passed in the control message that can be passed back by some other mechanism (e.g. recvmsg). Or by providing a user address that can be altered and a futex triggered on it. There's potentially a fourth bird too, but I'm not sure how practical they are: (D) What if TCP and UDP sockets, say, *only* do zerocopy? And that the syscall layer does the buffering transparently to hide that from the user? That could massively simplify the lower layers and perhaps make the buffering more efficient. For instance, the data could be organised by the top layer into (large) pages and then the protocol would divide that up. Smaller chunks that need to go immediately could be placed in kmalloc'd buffers rather than using a page frag allocator. There are some downsides/difficulties too. Firstly, it would probably render the checksum-whilst-copying impossible (though being able to use CPU copy acceleration might make up for that, as might checksum offload). Secondly, it would mean that sk_buffs would have at least two fragments - header and data - which might be impossible for some NICs. Thirdly, some protocols just want to copy the data into their own skbuffs whatever. There are also some issues with this proposal: (1) AF_ALG. This does its own thing, including direct I/O without MSG_ZEROCOPY being set. It doesn't actually use sk_buffs. Really, it's not a network protocol in the normal sense and might have been better implemented as, say, a bunch of functions in io_uring. (2) Packet crypto. Some protocols might want to do encryption from the source buffers into the skbuff and this would amount to a duplicate copy. This might be made more complicated by things like the TLS upper level protocol on TCP where we might be able to offload the crypto to the NIC, but might have to do it ourselves. (3) Is it possible to have a mixture of zerocopy and non-zerocopy pieces in the same sk_buff? If there's a mixture, it would be possible to deal with the non-zerocopy bit by allocating a zerocopy record and setting the cleanup function just to free it. Implementation notes: (1) What I'm thinking is that there will be an 'iov_manager' struct that manages a single call to sendmsg(). This will be refcounted and carry the completion state (kind of like ubuf_info) and the cleanup function pointer. (2) The upper layer will wrap iov_manager in its own thing (kind of like ubuf_info_msgzc). (3) For sys_sendmsg(), sys_sendmmsg() and io_uring() use a 'syscall-level manager' that will progressively pin and unpin userspace buffers. (a) This will keep a list of the memory fragments it currently has pinned in a rolling buffer. It has to be able to find them to unpin them and it has to allow for the userspace addresses having been remapped or unmapped. (b) As its refill function gets called, the manager will pin more pages and add them to the producer end of the buffer. (c) These can then be extracted by the protocol into skbuffs. (d) As its cleanup function gets called, it will advance the consumer end and unpin/discard memory ranges that are consumed. I'm not sure how much drag this might add to performance, though, so it will need to be tried and benchmarked. (4) Possibly, the list of fragments can be made directly available through an iov_iter type and a subset attached directly to a sk_buff. (5) SOCK_STREAM sockets will keep an ordered list of manager structs, each tagged with the byte transmission sequence range for that struct. The socket will keep a transmission completion sequence counter and as the counter advances through the manager list, their cleanup functions will be called and, ultimately, they'll be detached and put. (6) SOCK_DGRAM sockets will keep a list of manager structs on the sk_buff as well as on the socket. The problem is that they may complete out of order, but SO_EE_ORIGIN_ZEROCOPY works by byte position. Each time a sk_buff completes, all the managers attached to it are marked complete, but complete managers can only get cleaned up when they reach the front of the queue. (7) Kernel services will wrap iov_manager in their own wrapper and will pass down iterator that describes their message in its entirety through an iov_iter. Finally, this doesn't cover recvmsg() zerocopy, which might also have some of the same issues. David