From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 44FE0C3ABC3 for ; Mon, 12 May 2025 14:51:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 39A736B015B; Mon, 12 May 2025 10:51:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 320756B015C; Mon, 12 May 2025 10:51:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1C4016B015D; Mon, 12 May 2025 10:51:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id EF6A36B015B for ; Mon, 12 May 2025 10:51:43 -0400 (EDT) Received: from smtpin16.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 070581D081D for ; Mon, 12 May 2025 14:51:45 +0000 (UTC) X-FDA: 83434544970.16.FD527D6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf01.hostedemail.com (Postfix) with ESMTP id 260BF4000F for ; Mon, 12 May 2025 14:51:42 +0000 (UTC) Authentication-Results: imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X0TxoF4r; spf=pass (imf01.hostedemail.com: domain of dhowells@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1747061503; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ymBHISBoRGTxFMCkiuee6KTKTNeCUMLWNl/W6hAI6Kw=; b=rjFIQhpr+NiP47cZ0z98M9IlA/niinqGWL2bBASTQICf+c/0/JPIXnHVZbqpPgrfUnP5xD iIkffUdRx45ERZyfmcLUuM6SpN20cebQ8IaXD1DjtgVjMT6850c/sHuvb9FcAQVxSI8aoM nGhW/3wK29x4aynHZuuvTJ2B0n8BEME= ARC-Authentication-Results: i=1; imf01.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=X0TxoF4r; spf=pass (imf01.hostedemail.com: domain of dhowells@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=dhowells@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1747061503; a=rsa-sha256; cv=none; b=ojEh+CEClSTGHIuhWNXnCtkjX7S74SkNMDK6CNUOjy4Jfze/buPxyNVOd5BLob58t85QI8 uRYGQo39IJWIjKXp5jOYEK+i42LNNe5GO2WvNTwXPdkFrr2LKv0r1aAXSUsm4jaq+wLZPE SlRTDgGuPE5OgvgUxbLaDIUIfY/37Yk= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1747061502; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ymBHISBoRGTxFMCkiuee6KTKTNeCUMLWNl/W6hAI6Kw=; b=X0TxoF4rflnr9K88PePZb5mqwiip//GiK5P3tm74VITwPf5/mfrvFLQqTABntrlLxG35bI OtWE7jh/RAQY+qMnv7p+sL4cQKLEOvlxzGcJDHo1FyWUboW/+my8lYerR5OkH5WUJEUeuK cM/ay5GOS/TpwOAjViCRNXkc3bFmhFM= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-474-kDroPQX2Oc-xfnwc7e4WeQ-1; Mon, 12 May 2025 10:51:40 -0400 X-MC-Unique: kDroPQX2Oc-xfnwc7e4WeQ-1 X-Mimecast-MFC-AGG-ID: kDroPQX2Oc-xfnwc7e4WeQ_1747061498 Received: from mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.4]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 653311955D7F; Mon, 12 May 2025 14:51:37 +0000 (UTC) Received: from warthog.procyon.org.uk (unknown [10.42.28.188]) by mx-prod-int-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id DFD7030001A1; Mon, 12 May 2025 14:51:31 +0000 (UTC) Organization: Red Hat UK Ltd. Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 3798903 From: David Howells In-Reply-To: <1069540.1746202908@warthog.procyon.org.uk> References: <1069540.1746202908@warthog.procyon.org.uk> <165f5d5b-34f2-40de-b0ec-8c1ca36babe8@lunn.ch> <0aa1b4a2-47b2-40a4-ae14-ce2dd457a1f7@lunn.ch> <1015189.1746187621@warthog.procyon.org.uk> <1021352.1746193306@warthog.procyon.org.uk> To: Andrew Lunn Cc: dhowells@redhat.com, Eric Dumazet , "David S. Miller" , Jakub Kicinski , David Hildenbrand , John Hubbard , Christoph Hellwig , willy@infradead.org, Christian Brauner , Al Viro , Miklos Szeredi , torvalds@linux-foundation.org, netdev@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org Subject: AF_UNIX/zerocopy/pipe/vmsplice/splice vs FOLL_PIN MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-ID: <2135906.1747061490.1@warthog.procyon.org.uk> Date: Mon, 12 May 2025 15:51:30 +0100 Message-ID: <2135907.1747061490@warthog.procyon.org.uk> X-Scanned-By: MIMEDefang 3.4.1 on 10.30.177.4 X-Rspamd-Server: rspam10 X-Stat-Signature: obo7s4kw3ra8fkedrzjpbmfzcos76p7d X-Rspamd-Queue-Id: 260BF4000F X-Rspam-User: X-HE-Tag: 1747061502-676940 X-HE-Meta: U2FsdGVkX19oQo9LdYm++V8LCEAyX2vd6t109h0I25+EDqApOFAbxfnqAMI//kRVRoMD2LOVLO3LVC1AllOdnxr+IHVqN0ti8P1afynMWlOgW79CM0EukV+Q7n7Wy+sR8WCcM4TVkEh1GPJi1j2cO3AAB2K167a4OmG1EAPOwFJwmDSUlpquJ1mMJS1YYw9jT1L8u1MdCQ+aKpFdIYCqVv3MRLYbW0SqxpCN1eY7RW+qYWnMDz8eCNF6LHubQ+XnwdRgPRxlqdXYL8EnB5uKCONwcmrB+wwyT0FAFTfu387PcNRYhVhnH4N/y6Fi6x1XzdBH+onegz5KR1fm9MLVDjpGKQgxAgxJ3TSBOH8LDz+TQR3i7/ZKoJ76bHeauXi998Q4autH3v7eOnyqamgScSoxU0h2qmPpT3mwv1gZnay5CDS+j/GNf2gsmvVtShfVJYjjajd7wl3DxZjNZlNjRIUF6V1bWnM5wS0BoFaStLjO5UTzKzfjXXxvUnTaxx/pQ64eeIBfUKLS7AhdYcARqh6F0IR6p5RThHktFzN0qkNxJKX0cDQIgqZMWHr4SGpipprQbI3CA5Zz83VI73MyJddxSrAQ1ujX4lMXW5+IzTV7uzdaHfZcB/uin5LA7gVp+Da42b4qe7zhTA+rJa6wCr2QQ4Op/BP2iiBxVu6xf0YQcl+GtYDfwoJndX140vBlSkNI/S6WwWq3I8d5Fj1vnxzWth+SNntfMUgwELFE9DzGNfkINsW/uNJkmnm3YEGVO1QZIbqoSlmvQh6fYkpIe6uLd5cJ4E8cCsniucCY0z0WJxbtjTJQOGOnHZJu39tNam2YRkarfbgOvvaIJv9tWruoxsyrqxtVHFwc6I7mGscGTgERZImNEdp28SzdD3p952vTY9GQisDIHhujMs3d0LNvpxuZrQQcA0yowXaivfzYdIvvmXzhIo9M/mb3hlRmp3eGCGJ/eJEsTsADYY+ 8H0YTeif KHXXVshIj6Yy49hsOvtNZHuUiO50aydI/W+VSb3/KEMosgQGEjT/qVu5K+DKDZZCnQP2Jh7LcSsRP4WIUSf2EPo0W6KeGacZwZy457mWGz5/eI1B7MYjmzxeUURZ01smx3YuQUGJF2WGTTmgKH0ab8tRRdqfTp/oFlkGOFS91RO02SmccH3cvO07fSqqovinCrJZOpLSexOJGiRm2hBKS/zgLC1aDvp2Gw5TN23Cf8XbTrfC3TA1+MXVjmXZTYCnJOJ0NT1D2rA56tyT2m0LT5GE2ZBsZe9tae4RAou5f4nw4mHje2rZnOPhq0PuRGLDponwajOl8Ig0VoZ5lo+fpt4iG/RSbDEOF1TwyM3E3gquT7fqlZETwpI2bZkKLVn2Ptbc9 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: I'm looking at how to make sendmsg() handle page pinning - and also working towards supporting the page refcount eventually being removed and only being available with certain memory types. One of the outstanding issues is in sendmsg(). Analogously with DIO writes, sendmsg() should be pinning memory (FOLL_PIN/GUP) rather than simply getting refs on it before it attaches it to an sk_buff. Without this, if memory is spliced into an AF_UNIX socket and then the process forks, that memory gets attached to the child process, and the child can alter the data, probably by accident, if the memory is on the stack or in the heap. Further, kernel services can use MSG_SPLICE_PAGES to attach memory directly to an AF_UNIX pipe (though I'm not sure if anyone actually does this). (For writing to TCP/UDP with MSG_ZEROCOPY, MSG_SPLICE_PAGES or vmsplice, I think we're probably fine - assuming the loopback driver doesn't give the receiver the transmitter's buffers to use directly... This may be a big 'if'.) Now, this probably wouldn't be a problem, but for the fact that one can also splice this stuff back *out* of the socket. The same issues exist for pipes too. The question is what should happen here to a memory span for which the network layer or pipe driver is not allowed to take reference, but rather must call a destructor? Particularly if, say, it's just a small part of a larger span. It seems reasonable that we should allow pinned memory spans to be queued in a socket or a pipe - that way, we only have to copy the data once in the event that the data is extracted with read(), recvmsg() or similar. But if it's spliced out we then have all the fun of managing the lifetime - especially if it's a big transfer that gets split into bits. In such a case, I wonder if we can just duplicate the memory at splice-out rather than trying to keep all the tracking intact. If the memory was copied in, then moving the pages should be fine - though the memory may not be of a ref'able type (which would be fun if bits of such a page get spliced to different places). I'm sure there is some app somewhere (fuse maybe?) where this would be a performance problem, though. And then there's vmsplice(). The same goes for vmsplice() to AF_UNIX or to a pipe. That should also pin memory. It may also be possible to vmsplice a pinned page into the target process's VM or a page from a memory span with some other type of destruction. I don't suppose we can deprecate vmsplice()? David