From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7A424C7619A for ; Sun, 2 Apr 2023 15:10:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EC09E6B0071; Sun, 2 Apr 2023 11:10:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E70296B0074; Sun, 2 Apr 2023 11:10:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D10A06B0075; Sun, 2 Apr 2023 11:10:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id C15806B0071 for ; Sun, 2 Apr 2023 11:10:53 -0400 (EDT) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 45E01160864 for ; Sun, 2 Apr 2023 15:10:53 +0000 (UTC) X-FDA: 80636788386.08.C72B4A6 Received: from mail-qv1-f51.google.com (mail-qv1-f51.google.com [209.85.219.51]) by imf04.hostedemail.com (Postfix) with ESMTP id 7C57840013 for ; Sun, 2 Apr 2023 15:10:51 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=FQEaAngQ; spf=pass (imf04.hostedemail.com: domain of willemdebruijn.kernel@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=willemdebruijn.kernel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680448251; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BHC0GFWUwq2JnOuNQ5ZOk+vWFfVa0/7RcD29gvIoxDw=; b=p0/qiCEbmoh2cMVS+6DhWNVLDwWPamjXTTJYSbkE6N3HMldy2qE8ga8xBjcE+cJ8w0w7x+ YkqSjwd8hyeDkBW4ejACdK8MjhVZZb/LmjXCEZoyg0LpYAMv6/THD1n0IAWga1xSuciUzI Demm3knkdAUAaqnRm8fZWi/OWfTefSo= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=FQEaAngQ; spf=pass (imf04.hostedemail.com: domain of willemdebruijn.kernel@gmail.com designates 209.85.219.51 as permitted sender) smtp.mailfrom=willemdebruijn.kernel@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680448251; a=rsa-sha256; cv=none; b=UhQWX03xjxoXYPQKHP/XZBUlHAZdxyQkf4nM/ttGOYHijwA/gHUDgHNIeM07CbR5Py/aKU fCAIxFulC17tzxV9UbLiPVQbHl+utQeQ0zEGfavNJpdYRvcdAEsTHlmZiOxdaNmkCuKN9a mB41WIpa9mFqP0kVhbwjYTLRuVQ9ZY0= Received: by mail-qv1-f51.google.com with SMTP id x8so19515529qvr.9 for ; Sun, 02 Apr 2023 08:10:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; t=1680448250; x=1683040250; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:from:to:cc:subject:date :message-id:reply-to; bh=BHC0GFWUwq2JnOuNQ5ZOk+vWFfVa0/7RcD29gvIoxDw=; b=FQEaAngQkHhMdTI5rcyg59rY9t+Y0cgwgpzLxRnxlucn8XpzfIujXvSwBsZE6DGCm1 yDqKORXWV2Uam/Ny5+nlFIh9C3Y3lO5s/2qj8FwczyUcuOq0rVoAxdOd8coHaOnzo8yB TzB7ScCNUkMxw9TnrzA6G+MJef4UC12owUWWYryO8qXn99UfDOItwuXV/zuVmNzTRWLX w+VLlhuzJzhdwSGTbCZPJ/hx3Z6hm5Y1Ep4FixdzcViEetZaiAigjtEUSDzKBe5All1v NS/X1uEPKsJT97Q4kg8RDc2iRfB7rIOletDCj7Dvt7a8u/4G73KgkFx5uKwWA26XFAT2 GKdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680448250; x=1683040250; h=content-transfer-encoding:mime-version:subject:references :in-reply-to:message-id:cc:to:from:date:x-gm-message-state:from:to :cc:subject:date:message-id:reply-to; bh=BHC0GFWUwq2JnOuNQ5ZOk+vWFfVa0/7RcD29gvIoxDw=; b=TNkDhnqCc/tlahZ5HabNO5ZW9Os6JaWK0R8ywkEG3aa8MFzyTMQ+iwj3rDtFKf+Vxf mrOzjeWn8ADs8yxlmmlOwEWL6pzVo/LI862vLIVQQeBBOoclfUPWJrBR/1CVyRW/d0vO tp/NUPgruwcRi3KrPZxb1g8v8GObZ/RwW8WrxgD5pXfnZ7qu5sl0ZFRR7Q/ul4dGNE/L Xr8OTFXDJzKSuOovPVqVWYXyKjDRH1e/vkJDer8mTCPIMJr4eArxMngW5TLUbm/rF+qK 1fukm9vjHPwX93Fmtyi0qoZZspcCJoW1b76ReoaHJlmd4WYXX5P+znPlqi0TvxYAFbJD 8Mtw== X-Gm-Message-State: AAQBX9doVjxUEv2vsB4ACwJaq0LG7itwfGA4QyqTj8rmlT0GHMCRlhGS yVqQFgllHDpPeG0J7ajQjCQ= X-Google-Smtp-Source: AKy350ZX70VyypDcKzgDCIFtwpdv3xFpsjXteqYT0mVokk1I3cZ6ka1KubBvzH5Ch7y19hqmtCyJGQ== X-Received: by 2002:ad4:5ca2:0:b0:56e:9ef7:e76d with SMTP id q2-20020ad45ca2000000b0056e9ef7e76dmr58249480qvh.1.1680448250672; Sun, 02 Apr 2023 08:10:50 -0700 (PDT) Received: from localhost (240.157.150.34.bc.googleusercontent.com. [34.150.157.240]) by smtp.gmail.com with ESMTPSA id jy21-20020a0562142b5500b005e35629b7c4sm1098645qvb.3.2023.04.02.08.10.50 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 02 Apr 2023 08:10:50 -0700 (PDT) Date: Sun, 02 Apr 2023 11:10:49 -0400 From: Willem de Bruijn To: David Howells , Matthew Wilcox , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni Cc: David Howells , Al Viro , Christoph Hellwig , Jens Axboe , Jeff Layton , Christian Brauner , Chuck Lever III , Linus Torvalds , netdev@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, Willem de Bruijn Message-ID: <64299af9e8861_2d2a20208e6@willemb.c.googlers.com.notmuch> In-Reply-To: <20230331160914.1608208-16-dhowells@redhat.com> References: <20230331160914.1608208-1-dhowells@redhat.com> <20230331160914.1608208-16-dhowells@redhat.com> Subject: RE: [PATCH v3 15/55] ip, udp: Support MSG_SPLICE_PAGES Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 7C57840013 X-Stat-Signature: ka9tfn3x8o41wuar1dofr6yuwsbyu4kd X-HE-Tag: 1680448251-812966 X-HE-Meta: U2FsdGVkX1/1xZyno5hLAvYv9kGiv/0cxczXXXGhraXfNxqjdaIEW5RAyFQY+MfnKLO4zAuR8tfAHQAQaf+8JcvSJKc2mrugt1wmnZTCDQYRRIUBHs+REqgsGLdzrmpT9/v3tme/DB8/MW4VdsdSxk2a1vwbjUtrRcW7G8FOa3U4RhRfudmToYtsDt8MtSb+lwZ9GBT3Eop7wteNUlkyKb+gjj5Le8Y5amlQ2C3fbkAftSLMM5xFr0z6BrVWYBp/yTMPhSjDgakhrt1uVKR4v8Lpq0dg9Zyy3uxV24lTLtZRBs6NYUPTudNI08Ml0CRyB3zK8WvZZX+gRBI0sJ4mEn90WbWBJJ6MPwucRQnGIR4OkAW7stcAuwsVSemCtTN50tuikyvlnuGUACrIVrNLI+Cg3R3lejyMfza6NXAjQcTqZ9GQ8a7f52SVfQ+vIxwXGHVqE3H+9IabLdogq/tvKS55INoxWhPCxspMMGYvqMWWptE7TohwqaZw7+5LsJJjRbX46JDhgx1G8TjLNlsfmwkXlIgAkcdqOHZnk5WCHXQElhAvjBlIBHqE9M318IKMnkbUZmPxDfo+p9Vbs6QkyzfLHquI9iBGse+KjHzHXdJCtzm3hDXAum54u1SLjM9WR1lFmt/ed1/btmZlkXH2MATIYG/TBDl8Nkgk++5PNhcYtumAlRE+a8E/PMoOURAtdE2aaf/P/AkAFVqegE5BCR1tvOKtF96ZcI6dWSuJJCjhs+a/KuTFWJD+CqDhUZtBQ+ddhaK6CBVcU2rxrveOTWTLJc1zkRmFhDYtsspb5MTzNxXMWVUuVN5BuftOwjwjeUoq1C7bEZu+g9lIl7ceOEUcP82EE4VN/dOt9Q9ckREEC5/08bnFmzJ33XeGyQKBFUEOq6lAWt8YGmfOwLrcaoarr5DnY5xvCyI1UTlDJFtfoQqU0q5MEDEWMK+JCwmKlK0nAfm95T7aMXAGKbM GLKUcy9l kqeiO5nAw7SGJtCZQj5f+m8ej6ilen0idukCFG4duGjYofGfcMYEbLoBot3e+gAHDrdgQdeD7ZDEmF0BNUDAfHkWssml+BBG3z4U4d1a6hp1Y4BMzumxjxDBR0iw/diJ29x9o9JCNgkRofvl4QH6L0FiQ8Gv0ZDDdgWmq9bZCIcpQNMQVHlP+jnBjfTldJ2jrj8YTpCCDITCfb+M8GpDRxkH6WRmvGnmkAvnQ+HdGY2CbteZr70En35P9zvFJpKmSrcVUwG4jgZroniKdXnwGUPY6thwyDhvFVU1SFayXAjEBeqJf3VgdCg9Uyk1UVAxhnDHp5QEU8uf4p3SdAuMgkBTl5bRbZNPk5XquN7mxf0DQADFj02GE2j158u86jnEbfgVJ87jQNRVdkjBrBFS5WdGeH9eKVU9auu+8W18UEW+W7imQ/1hFQRFnA+uhutzYlmMdeX/r+LySAtZ7M63VGsSvy7lbFKUaT046B3u7nQ//eD0oITy0VMC7hR1/E0fCIgJyU5kaj3MCE5p+6Jf4WXVVEa5iJI+kAVLP64madVgrPoPnC80j2gB/MF0OJAXRr9x/vV7T0tPCBwI4E+rfVcJbSq2GvXgySkPw0TM7ZB1lM7xJDNM33T/CW/huachjcBAasXXGjR8u/JckWwK7Giu75vEuGGJnkGtudwytl8Q6dTRAEjpGPz8BjUKnqN2+Vo6SlHAoPzQwlKM39seNHoLPD+lyHzDT9b6zmE0fYo0dcUJ+NWxXcfkmQiYY1ZsOnD1AJ6whWUSTeFQHJs4zJT92MoXXN+jmVltswPYvgHRyCnBzqL1EAKtuLZ4HDq2q5vD4vrt03Dtv6+I= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: David Howells wrote: > Make IP/UDP sendmsg() support MSG_SPLICE_PAGES. This causes pages to be > spliced from the source iterator. > > This allows ->sendpage() to be replaced by something that can handle > multiple multipage folios in a single transaction. > > Signed-off-by: David Howells > cc: Willem de Bruijn > cc: "David S. Miller" > cc: Eric Dumazet > cc: Jakub Kicinski > cc: Paolo Abeni > cc: Jens Axboe > cc: Matthew Wilcox > cc: netdev@vger.kernel.org > --- > net/ipv4/ip_output.c | 102 +++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 99 insertions(+), 3 deletions(-) > > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c > index 4e4e308c3230..e2eaba817c1f 100644 > --- a/net/ipv4/ip_output.c > +++ b/net/ipv4/ip_output.c > @@ -956,6 +956,79 @@ csum_page(struct page *page, int offset, int copy) > return csum; > } > > +/* > + * Allocate a packet for MSG_SPLICE_PAGES. > + */ > +static int __ip_splice_alloc(struct sock *sk, struct sk_buff **pskb, > + unsigned int fragheaderlen, unsigned int maxfraglen, > + unsigned int hh_len) > +{ > + struct sk_buff *skb_prev = *pskb, *skb; > + unsigned int fraggap = skb_prev->len - maxfraglen; > + unsigned int alloclen = fragheaderlen + hh_len + fraggap + 15; > + > + skb = sock_wmalloc(sk, alloclen, 1, sk->sk_allocation); > + if (unlikely(!skb)) > + return -ENOBUFS; > + > + /* Fill in the control structures */ > + skb->ip_summed = CHECKSUM_NONE; > + skb->csum = 0; > + skb_reserve(skb, hh_len); > + > + /* Find where to start putting bytes. */ > + skb_put(skb, fragheaderlen + fraggap); > + skb_reset_network_header(skb); > + skb->transport_header = skb->network_header + fragheaderlen; > + if (fraggap) { > + skb->csum = skb_copy_and_csum_bits(skb_prev, maxfraglen, > + skb_transport_header(skb), > + fraggap); > + skb_prev->csum = csum_sub(skb_prev->csum, skb->csum); > + pskb_trim_unique(skb_prev, maxfraglen); > + } > + > + /* Put the packet on the pending queue. */ > + __skb_queue_tail(&sk->sk_write_queue, skb); > + *pskb = skb; > + return 0; > +} > + > +/* > + * Add (or copy) data pages for MSG_SPLICE_PAGES. > + */ > +static int __ip_splice_pages(struct sock *sk, struct sk_buff *skb, > + void *from, int *pcopy) > +{ > + struct msghdr *msg = from; > + struct page *page = NULL, **pages = &page; > + ssize_t copy = *pcopy; > + size_t off; > + int err; > + > + copy = iov_iter_extract_pages(&msg->msg_iter, &pages, copy, 1, 0, &off); > + if (copy <= 0) > + return copy ?: -EIO; > + > + err = skb_append_pagefrags(skb, page, off, copy); > + if (err < 0) { > + iov_iter_revert(&msg->msg_iter, copy); > + return err; > + } > + > + if (skb->ip_summed == CHECKSUM_NONE) { > + __wsum csum; > + > + csum = csum_page(page, off, copy); > + skb->csum = csum_block_add(skb->csum, csum, skb->len); > + } > + > + skb_len_add(skb, copy); > + refcount_add(copy, &sk->sk_wmem_alloc); > + *pcopy = copy; > + return 0; > +} These functions are derived from and replace ip_append_page. That can be removed once udp_sendpage is converted? > > static int __ip_append_data(struct sock *sk, > struct flowi4 *fl4, > struct sk_buff_head *queue, > @@ -977,7 +1050,7 @@ static int __ip_append_data(struct sock *sk, > int err; > int offset = 0; > bool zc = false; > - unsigned int maxfraglen, fragheaderlen, maxnonfragsize; > + unsigned int maxfraglen, fragheaderlen, maxnonfragsize, initial_length; > int csummode = CHECKSUM_NONE; > struct rtable *rt = (struct rtable *)cork->dst; > unsigned int wmem_alloc_delta = 0; > @@ -1017,6 +1090,7 @@ static int __ip_append_data(struct sock *sk, > (!exthdrlen || (rt->dst.dev->features & NETIF_F_HW_ESP_TX_CSUM))) > csummode = CHECKSUM_PARTIAL; > > + initial_length = length; > if ((flags & MSG_ZEROCOPY) && length) { > struct msghdr *msg = from; > > @@ -1047,6 +1121,14 @@ static int __ip_append_data(struct sock *sk, > skb_zcopy_set(skb, uarg, &extra_uref); > } > } > + } else if ((flags & MSG_SPLICE_PAGES) && length) { > + if (inet->hdrincl) > + return -EPERM; > + if (rt->dst.dev->features & NETIF_F_SG) > + /* We need an empty buffer to attach stuff to */ > + initial_length = transhdrlen; I still don't entirely understand what initial_length means. More importantly, transhdrlen can be zero. If not called for UDP but for RAW. Or if this is a subsequent call to a packet that is being held with MSG_MORE. This works fine for existing use-cases, which go to alloc_new_skb. Not sure how this case would be different. But the comment alludes that it does. > + else > + flags &= ~MSG_SPLICE_PAGES; > } > > cork->length += length; > @@ -1074,6 +1156,16 @@ static int __ip_append_data(struct sock *sk, > unsigned int alloclen, alloc_extra; > unsigned int pagedlen; > struct sk_buff *skb_prev; > + > + if (unlikely(flags & MSG_SPLICE_PAGES)) { > + err = __ip_splice_alloc(sk, &skb, fragheaderlen, > + maxfraglen, hh_len); > + if (err < 0) > + goto error; > + continue; > + } > + initial_length = length; > + > alloc_new_skb: > skb_prev = skb; > if (skb_prev) > @@ -1085,7 +1177,7 @@ static int __ip_append_data(struct sock *sk, > * If remaining data exceeds the mtu, > * we know we need more fragment(s). > */ > - datalen = length + fraggap; > + datalen = initial_length + fraggap; > if (datalen > mtu - fragheaderlen) > datalen = maxfraglen - fragheaderlen; > fraglen = datalen + fragheaderlen; > @@ -1099,7 +1191,7 @@ static int __ip_append_data(struct sock *sk, > * because we have no idea what fragment will be > * the last. > */ > - if (datalen == length + fraggap) > + if (datalen == initial_length + fraggap) > alloc_extra += rt->dst.trailer_len; > > if ((flags & MSG_MORE) && > @@ -1206,6 +1298,10 @@ static int __ip_append_data(struct sock *sk, > err = -EFAULT; > goto error; > } > + } else if (flags & MSG_SPLICE_PAGES) { > + err = __ip_splice_pages(sk, skb, from, ©); > + if (err < 0) > + goto error; > } else if (!zc) { > int i = skb_shinfo(skb)->nr_frags; > >