From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A60C3C5AD49 for ; Fri, 30 May 2025 16:22:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EDF56B0151; Fri, 30 May 2025 12:22:34 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 09E7E6B0153; Fri, 30 May 2025 12:22:34 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id EA8E76B0159; Fri, 30 May 2025 12:22:33 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id C4CC86B0151 for ; Fri, 30 May 2025 12:22:33 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 6245C1A0EFD for ; Fri, 30 May 2025 16:22:33 +0000 (UTC) X-FDA: 83500092186.06.604ECE8 Received: from mail-pl1-f180.google.com (mail-pl1-f180.google.com [209.85.214.180]) by imf21.hostedemail.com (Postfix) with ESMTP id 7CDA71C000D for ; Fri, 30 May 2025 16:22:31 +0000 (UTC) Authentication-Results: imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=QeBBpq5m; spf=pass (imf21.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1748622151; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=; b=qojFBll/GVnA7kFLQ9ugz6ptRe9UYUb+rkJtt3ZXxIoSwxbWvUAumKZqyNNizKcF/+gHJD 6uPssudMKKLNQnQMq/Avmcvolz3BUE/RRs0ZcZFhkdcm01K+W8z+ZvPl8P5zdcCbGbifrV +66uUb/2SicCc6x4mtWZY+TEXfkV+FE= ARC-Authentication-Results: i=1; imf21.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=QeBBpq5m; spf=pass (imf21.hostedemail.com: domain of almasrymina@google.com designates 209.85.214.180 as permitted sender) smtp.mailfrom=almasrymina@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1748622151; a=rsa-sha256; cv=none; b=GxN5syn1sjI36El5Zr/tXAxujjeRnOEKFh5BuDJA/YQGCD+wMQ5PR2PvNEHyt/gIMX2Kb/ SDCREy5fF9BI6loyTq51B63v5qCcQsJneAskJaN0nSyZK50kFmREutoIrI6+bK2mFfTJJK EOsQyz1mc9chWJmOlRQMsCkR1X+v2CQ= Received: by mail-pl1-f180.google.com with SMTP id d9443c01a7336-2349068ebc7so228505ad.0 for ; Fri, 30 May 2025 09:22:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1748622150; x=1749226950; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=; b=QeBBpq5mapYN2N7TBVbr+JIMHzY5UVH7w+DNOEIA0iuY4i84MfRnBwN5r/1eC6OEIu vsAiUv272XErwGpYjPf17niOEZY+q5/G4TQ1Ou5/YIHxvs/8qPdQkPOf7Q+XOoNKwEXG rwcSOWL9+d1wL1I3jkTnp4K5mNF7QGYdJE1SkudNtoRJtOOOQeh65SYliTmQXL1kwVHc iNysKpEwbiqpQi1bkKjkFEZsccPwH0monRKLTa8d5O3ypIZiQ8Cl1YSJvnyG3EXvfFBy BPH9MElj99nSSzb/nO1aKQJxCyBJsEbTimj0EuDB1vbmmZtVBiifP42x/zSVj2Hhrs2d K9Ow== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748622150; x=1749226950; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=vf6Yud61p5FojbXNBMIgSMyrP3LQYQ4cOutBnHeT7EE=; b=aQVvsrPpoUMc6ZcrdoI6RoA5QACVaYjgKqzGJW6QfmxlCLORnSPj768ZcyMwCklaMX tmLnaqiYPq3wf22jT1HQm1xm3jseV6SNzJZ/GcugYG2ul9mjvO3ebEH2G3gIcVwPwDBA DR1sugnTCqol86hDtLgnhGSm3fDK8chEI/F30JIsV7IJl2yM4dMdIeZeEiTPIVz7u2/l PzDVFrcjK3YOYvILYgGSo+J5CYLf55k8H9MzhbbV0ZBPbYGNLG4ZqL5B0rV2YM3SaBW6 4UvuoiNnf8SvptMmG9uyd8DPr2QQEjjk9kc1hB/nkZ4Wz+pNCgeDrVlBGrCbq/KBEBxH YLgg== X-Forwarded-Encrypted: i=1; AJvYcCU3yIRfk+L2+F4PVno1fNXmo5AoM4fwCvlgpn2QRbIHta63CeOpWbF0XNeM2TIZXX0uy+n5GqCe8A==@kvack.org X-Gm-Message-State: AOJu0YxIRqsv/hXDSaNsmj/MybyYqPS0Eak6dgHvXbwdRUs43D2FgBKA f332LNxOYi+9PpDtfYi3GtcF2kRqi6vTnUm6TKuIU8l+7BRtKwUvjrsLEs5NM/sVIE2X5edyGPW iLmHkNNEWI33ohuSho1xBBDYjFrJshZGq9andnHeQ X-Gm-Gg: ASbGnctkaashxUecwhBf0HQLUnhZJUMqDk2YSHp2GWirk3LCN7v69L5bn4NFtQx3s4n I4W/CpKQTWxZMXmzkRxZtmosImT9m51tahaPu3bvNox+98KcE2iXbTw0ANLAJEtKCP4rMKE8q8M PbLbc4GhaLMzRGbwtIthOmW/ukS3jOFk1DgCfzWpfber26HTUx6gVw/Q0= X-Google-Smtp-Source: AGHT+IEtiJzkhMrbT4HUzOjYjtF5aql9Xb8hdrSN3Cti84xfY4Vb6+gwtO/xOvIq6bw5m4mFgqm8TBezlJPtYjT+ba0= X-Received: by 2002:a17:903:32c2:b0:215:f0c6:4dbf with SMTP id d9443c01a7336-2352dfcade3mr3925465ad.14.1748622149915; Fri, 30 May 2025 09:22:29 -0700 (PDT) MIME-Version: 1.0 References: <770012.1748618092@warthog.procyon.org.uk> In-Reply-To: <770012.1748618092@warthog.procyon.org.uk> From: Mina Almasry Date: Fri, 30 May 2025 09:22:17 -0700 X-Gm-Features: AX0GCFv_9CblvKqE59Fn0enXIWdu9lK-Q9GgqjjgMg5PEL2NBfaZtCi1kNaSC5g Message-ID: Subject: Re: Device mem changes vs pinning/zerocopy changes To: David Howells Cc: willy@infradead.org, hch@infradead.org, Jakub Kicinski , Eric Dumazet , netdev@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Stat-Signature: 7qakktiicwon8wgwdpn8by6ybig5ffnt X-Rspamd-Queue-Id: 7CDA71C000D X-Rspamd-Server: rspam11 X-HE-Tag: 1748622151-614168 X-HE-Meta: U2FsdGVkX1+4L843IKtJsJzaCS71CLilA2vImzpL6KCr6bUODsfRgvMg26B2W+Cb4t05O2xvFdgLBY9YqL+n4HqjYfhyvvlW9Kovo5JpFrJdecJgBc/CxKu2nzf+2EQThYnzjurUwgFS5j3vM8k8kMJuvP3GxTyPv1PnMmJYs8nfBOcXs2fLq0TXfjsY46uT2YOKnWG74lNl7mds0vev2o+3Oe0EdAAIeB6aNxSnGd2/QqYfHDg/ORtinHMiBxbZ7aeZSJJI/uzY+xhJIB6G+ccX0H3tw3WQaPASZOIN/fiMinXbuRdGoLiShajW6Ufgythja2mDq2OESAxzpRHjLEFeEHPrA+DOLLJlcr7RZwE4Z7+yh8dVajEE+pZi0drYyv1w9oaFEmS+lbDXK1eT7iDhb+l6vLu4uYon8AyjBv46VOOUYzwPAGj0TzctGryT+80cHbTEjGWdxFHnc9RgT0PvK2eUSwQWbAhpvI+HKy+i9H1exelM764SFN11Vw5V5Hi7vz32lWzk0iAASSk2NcgriBkGICPiHMe9dXkXC6teyteMwDBzGwfWRi6I2ur0NJhmHrU4uLLfu9OzJrIGyVdHNPcTXNW9MR7izakLbsp5e3bGDI/YhBzU7wrvgLoA1exi+ihC21b3LUx19KPi/2POR9He0XUpzZODi6NF5X3IZsSfPNGfenEJNBfbruHUqrKqiMwgmAiTlLtLCSWwT17K8ome/keFrH2pEm5AhdYmWOwRQzg8a4yQQEkyrbiXOVRipVPUa1P8aQtveHPhJmW5Wgtz5AzC3gWvCTdRiEuPvo7SKEZzqIhD9iC7SPWGjFiPMpIydujGtHRkyxITA3KQW5nt9Mc/IDAdJ1vzNCczpkAuzrHRUxfNYcPTIkwxthgjGslOlV8RBaRE1Khmtc8QPRirOiSskEtvj5hGIPejD0a1U1Cpf0zd/MDM0nGHI3OD3hZ6i/99ky/Sot/ 9cIggmS2 zCFDDWch4YIwNpEWzmo1n1YjN/9J4sumLtnFDhMsSdN1DswdVRi6qmIbSYu3EhqusD2lBN9GMuy+k5QLJu90xl+AsRFiTj3xOL+hpjMxOHPAqOmwt+bkIITfGwASBpVQw9nZru8kYXap/a1gu8dB82Ud1hp72t3RJC1J2AU1zUuqR0cE7uUIq+KDFAWw6pdievjY5aGhee1JbsGnmwhEEzVbABLflSHSovUnOVLf44LulsbBHqfriHT8ThNAhi/MV5KecgFHePd4mW015LGEAhhfjd/W/hklfa+HBf4vQ3HLh2SFqZEhChvOWeKOiCI9sasOR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Fri, May 30, 2025 at 8:15=E2=80=AFAM David Howells = wrote: > > Hi Mina, > > I've seen your transmission-side TCP devicemem stuff has just gone in and= it > conflicts somewhat with what I'm trying to do. I think you're working on= the > problem bottom up and I'm working on it top down, so if you're willing to > collaborate on it...? > Hi David! Yes, very happy to collaborate. FWIW, my initial gut feeling is that the work doesn't conflict that much. The tcp devmem netmem/net_iov stuff is designed to follow the page stuff, and as the usage of struct page changes we're happy moving net_iovs and netmems to do the same thing. My read is that it will take a small amount of extra work, but there are no in-principle design conflicts, at least AFAICT so far. > So, to summarise what we need to change (you may already know all of this= ): > > (*) The refcount in struct page is going to go away. The sk_buff fragme= nt > wrangling code, however, occasionally decides to override the zeroco= py > mode and grab refs on the pages pointed to by those fragments. sk_b= uffs > *really* want those page refs - and it does simplify memory handling= . > But. > I believe the main challenge here is that there are many code paths in the net stack that expect to be able to grab a ref on skb frags. See all the callers of skb_frag_ref/skb_frag_unref: tcp_grow_skb, __skb_zcopy_downgrade_managed, __pskb_copy_fclone, pskb_expand_head, skb_zerocopy, skb_split, pksb_carve_inside_header, pskb_care_inside_nonlinear, tcp_clone_payload, skb_segment. I think to accomplish what you're describing we need to modify skb_frag_ref to do something else other than taking a reference on the page or net_iov. I think maybe taking a reference on the skb itself may be acceptable, and the skb can 'guarantee' that the individual frags underneath it don't disappear while these functions are executing. But devmem TCP doesn't get much in the way here, AFAICT. It's really the fact that so many of the networking code paths want to obtain page refs via skb_frag_ref so there is potentially a lot of code to touch. > Anyway, we need to stop taking refs where possible. A fragment may = in > future point to a sequence of pages and we would only be getting a r= ef on > one of them. > > (*) Further, the page struct is intended to be slimmed down to a single = typed > pointer if possible, so all the metadata in the net_iov struct will = have > to be separately allocated. > Yes, I'm already collaborating with Byungchul on this, and we're making great progress. I think this may be close to getting fully reviewed: https://lore.kernel.org/netdev/20250509115126.63190-1-byungchul@sk.com/ According to his cover letter, he's actually finding the work that I did with netmem/net_iovs useful for him. > (*) Currently, when performing MSG_ZEROCOPY, we just take refs on the us= er > pages specified by the iterator but we need to stop doing that. We = need > to call GUP to take a "pin" instead (and must not take any refs). T= he > pages we get access to may be folio-type, anon-type, some sort of de= vice > type. > This also doesn't conflict with devmem changes, AFAICT. Currently in devmem we take references on the user devmem only because pages also take a reference on the user pages (we mirror devmem and pages to keep the netstack uniform without excessive mem-type checks). If the code path for zerocopy_fill_skb_from_iter doesn't need to take ref on the pages, we'll just migrate zerocopy_fill_skb_from_devmem to do the same thing. > (*) It would be good to do a batch lookup of user buffers to cut down on= the > number of page table trawls we do - but, on the other hand, that mig= ht > generate more page faults upfront. > > (*) Splice and vmsplice. If only I could uninvent them... Anyway, they= give > us buffers from a pipe - but the buffers come with destructors and s= hould > not have refs taken on the pages we might think they have, but use t= he > destructor instead. > The above 2 points are orthogonal to devmem. There is no page table walks, splice, or vmsplice with devmem. We just have to make sure that the generic changes you're implementing also work with the devmem paths. I can test your changes and point if I see any issue, but I don't see a conflict in-principle. > (*) The intention is to change struct bio_vec to be just physical addres= s and > length, with no page pointer. You'd then use, say, kmap_local_phys(= ) or > kmap_local_bvec() to access the contents from the cpu. We could the= n > revert the fragment pointers to being bio_vecs. > Ok, this part conflicts a bit, maybe. skb_frag_ref no longer uses struct bio_vec. I merged that change before 6.12 kernel, it's not a very recent change: https://lore.kernel.org/netdev/20240214223405.1972973-3-almasrymina@google.= com/ But, AFAICT, skb_frag_t needs a struct page inside of it, not just a physical address. skb_frags can mmap'd into userspace for TCP zerocopy, see tcp_zerocopy_vm_insert_batch (which is a very old feature, it's not a recent change). There may be other call paths in the net stack that require a full page and just a physical address will do. (unless somehow we can mmap a physical address to the userspace). > (*) Kernel services, such as network filesystems, can't pass kmalloc()'d= data > to sendmsg(MSG_SPLICE_PAGES) because slabs don't have refcounts and,= in > any case, the object lifetime is not managed by refcount. However, = if we > had a destructor, this restriction could go away. > This also sounds orthogonal to devmem. > > So what I'd like to do is: > > (1) Separate fragment lifetime management from sk_buff. No more wanglin= g of > refcounts in the skbuff code. If you clone an skb, you stick an ext= ra > ref on the lifetime management struct, not the page. > Agreed, and AFAICT devmem doesn't get in the way. Let me know if you disagree or are seeing something different. > (2) Create a chainable 'network buffer' struct, e.g.: > > enum net_txbuf_type { > NET_TXBUF_BUFFERED, /* Buffered copy of data */ > NET_TXBUF_ZCOPY_USER, /* Zerocopy of user buffers */ > NET_TXBUF_ZCOPY_KERNEL, /* Zerocopy of kernel buffers */ > }; > > struct net_txbuf { > struct net_txbuf next; > struct mmpin mm_pin; > unsigned int start_pos; > unsigned int end_pos; > unsigned int extracted_to; > refcount_t ref; > enum net_txbuf_type type; > u8 nr_used; > bool wmem_charged; > bool got_copied; > union { > /* For NET_TXBUF_BUFFERED: */ > struct { > void *bufs[16]; > u8 bufs_orders[16]; > bool last_buf_freeable; > }; > /* For NET_TXBUF_ZCOPY_*: */ > struct { > struct sock *sk; > struct sk_buff *notify; > msg_completion_t completion; > void *completion_data; > struct bio_vec frags[12]; > }; > }; > }; > > (Note this is very much still a WiP and very much subject to change) > > So how I envision it working depends on the type of flow in the sock= et. > For the transmission side of streaming sockets (e.g. TCP), the socke= t > maintains a single chain of these. Each txbuf is of a single type, = but > multiple types can be interleaved. > > For non-ZC flow, as data is imported, it's copied into pages attache= d to > the current head txbuf of type BUFFERED, with more pages being attac= hed > as we progress. Successive writes just keep adding to the space in = the > latest page added and each skbuff generated pins the txbuf it starts= at > and each txbuf pins its successor. > > As skbuffs are consumed, they unpin the root txbuf. However, this c= ould > leave an awful lot of memory pinned for a long time, so I would miti= gate > this in two ways: firstly, where possible, keep track of the transmi= tted > byte position and progressively destruct the txbuf; secondly, if we > completely use up a partially filled txbuf then reset the queue. > > An skbuff's frag list then has a bio_vec[] that refers to fragments = of > the buffers recorded in the txbuf chain. An skbuff may span multipl= e > txbufs and a txbuf may provision multiple skbuffs. > > For the transmission side of datagram sockets (e.g. UDP) where the > messages may complete out of order, I think I would give each datagr= am > its own series of txbufs, but link the tails together to manage the > SO_EE_ORIGIN_ZEROCOPY notification generation if dealing with usersp= ace. > If dealing with the kernel, there's no need to link them together as= the > kernel can provide a destructor for each datagram. > To be honest, I didn't follow the entirety of point #2 here, but the problem is not related to devmem. Is struct net_txbuf intended to replace struct sk_buff in the tx path only? If so, I'm not sure that works. Currently TX and RX memory share a single data structure (sk_buff), and I believe that is critical. Because RX skbs can be forwarded to TX and vice-versa. You will need to implement an net_txbuf_to_skb helper and vice versa if you go this route, no? So I think, maybe, instead of introducing a new struct, you have to make the modifications you envision to struct sk_buff itself? > (3) When doing zerocopy from userspace, do calls to GUP to get batches = of > non-contiguous pages into a bio_vec array. > > (4) Because AF_UNIX and the loopback driver transfer packets from the > transmission queue of one socket down into the reception queue of > another, the use of txbufs would also need to extend onto the receiv= e > side (and so "txbufs" would be a misnomer). > OK, you realize that TX packets can be forwarded to RX. The opposite is true, RX can be forwarded to TX. And it's not just AF_UNIX and loopback. Packets can be forwarded via ip forwarding, and tc, and probably another half dozen features in the net stack I don't know about. I think you need to modify the existing sk_buff. I think adding a new struct and migrating the entire net stack to use that is a bit too ambitious. But up to you. Just my 2 cents here. > When receiving a packet, a txbuf would need to be allocated and the > received buffers attached to it. The pages wouldn't necessarily nee= d > refcounts as the txbuf holds them. The skbuff holds a ref on the tx= buf. > > (5) Cloning an skbuff would involve just taking an extra ref on the firs= t > txbuf. Splitting off part of an skbuff would involve fast-forwardin= g the > txbuf chain for the second part and pinning that. > > (6) I have a chained-bio_vec array concept with iov_iter type for it tha= t > might make it easier to string together the fragments in a reassembl= ed > packet and represent it as an iov_iter, thereby allowing us to use c= ommon > iterator routines for things like ICMP and packet crypto. > > (7) We need to separate net_iov from struct page, and it might make thin= gs > easier if we do that now, allocating net_iov from a slab. > net_iov is already separate from struct page. The only commonality they share is that we static_assert the offset of some page pool fields are the same in struct page and struct net_iov, but that's it. Byungchul is already handling the entire netmem_desc-shrink-struct-page in the series I linked to earlier and I would say his work is going well. > (8) Reference the txbuf in a splice and provide a destructor that drops = that > reference. For small splices, I'd be very tempted to simply copy th= e > data. For splice-out of data that was spliced into an AF_UNIX socke= t or > zerocopy data that passed through a loopback device, I'm also very > tempted to make splice copy at that point. There's a potential DoS > attack whereby someone can endlessly splice tiny bits of a message o= r > just sit on them, preventing the original provider from recovering i= ts > memory. > > (9) Make it easy for a network filesystem to create an entire compound > message and present it to the socket in a single sendmsg() with a > destructor. > > I've pushed my current changes (very incomplete as they are) to: > > https://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs= .git/log/?h=3Diov-experimental > > I'm writing functions to abstract out the loading of data into the txbuf = chain > and attach to skbuff. These can be found in skbuff.c as net_txbuf_*(). = I've > modified the TCP sendmsg to use them. > --=20 Thanks, Mina