[PATCH net-next v2 00/10] Replace page_frag with page_frag

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
@ 2024-12-06 12:25 Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 01/10] mm: page_frag: some minor refactoring before adding new API Yunsheng Lin
                   ` (10 more replies)
  0 siblings, 11 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck, Shuah Khan,
	Andrew Morton, Linux-MM

This is part 2 of "Replace page_frag with page_frag_cache",
which introduces the new API and replaces page_frag with
page_frag_cache for sk_page_frag().

The part 1 of "Replace page_frag with page_frag_cache" is in
[1].

After [2], there are still two implementations for page frag:

1. mm/page_alloc.c: net stack seems to be using it in the
   rx part with 'struct page_frag_cache' and the main API
   being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
   tx part with 'struct page_frag' and the main API being
   skb_page_frag_refill().

This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now as pcp
is also supported for high-order pages:
commit 44042b449872 ("mm/page_alloc: allow high-order pages to
be stored on the per-cpu lists")

As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.

After this patchset:
1. Unify the page frag implementation by taking the best out of
   two the existing implementations: we are able to save some space
   for the 'page_frag_cache' API user, and avoid 'get_page()' for
   the old 'page_frag' API user.
2. Future bugfix and performance can be done in one place, hence
   improving maintainability of page_frag's implementation.

Performance validation for part2:
1. Using micro-benchmark ko added in patch 1 to test aligned and
   non-aligned API performance impact for the existing users, there
   seems to be about 20% performance degradation for refactoring
   page_frag to support the new API, which seems to nullify most of
   the performance gain in [3] of part1.
2. Use the below netcat test case, there seems to be some minor
   performance gain for replacing 'page_frag' with 'page_frag_cache'
   using the new page_frag API after this patchset.
   server: taskset -c 32 nc -l -k 1234 > /dev/null
   client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

In order to avoid performance noise as much as possible, the testing
is done in system without any other load and have enough iterations to
prove the data is stable enough, complete log for testing is below:

perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000
perf stat -r 200 -- insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1
taskset -c 32 nc -l -k 1234 > /dev/null
perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

*After* this patchset:

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         18.753187      task-clock (msec)         #    0.003 CPUs utilized            ( +-  0.44% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.30% )
                 0      cpu-migrations            #    0.003 K/sec                    ( +- 32.09% )
                84      page-faults               #    0.004 M/sec                    ( +-  0.08% )
          48700826      cycles                    #    2.597 GHz                      ( +-  0.44% )
          62086543      instructions              #    1.27  insn per cycle           ( +-  0.03% )
          14869358      branches                  #  792.898 M/sec                    ( +-  0.03% )
             19639      branch-misses             #    0.13% of all branches          ( +-  0.60% )

       7.035285915 seconds time elapsed                                          ( +-  0.06% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         18.442151      task-clock (msec)         #    0.006 CPUs utilized            ( +-  0.01% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.40% )
                 0      cpu-migrations            #    0.001 K/sec                    ( +- 57.44% )
                84      page-faults               #    0.005 M/sec                    ( +-  0.08% )
          47890149      cycles                    #    2.597 GHz                      ( +-  0.01% )
          60718325      instructions              #    1.27  insn per cycle           ( +-  0.00% )
          14570862      branches                  #  790.085 M/sec                    ( +-  0.00% )
             19613      branch-misses             #    0.13% of all branches          ( +-  0.12% )

       3.210892358 seconds time elapsed                                          ( +-  0.12% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      16824.017944      task-clock (msec)         #    0.621 CPUs utilized            ( +-  0.02% )
           2987954      context-switches          #    0.178 M/sec                    ( +-  0.04% )
                 1      cpu-migrations            #    0.000 K/sec
                93      page-faults               #    0.006 K/sec                    ( +-  0.09% )
       31982647267      cycles                    #    1.901 GHz                      ( +-  0.03% )
       38907812424      instructions              #    1.22  insn per cycle           ( +-  0.02% )
        7112328962      branches                  #  422.749 M/sec                    ( +-  0.03% )
          94789062      branch-misses             #    1.33% of all branches          ( +-  0.21% )

      27.104994660 seconds time elapsed                                          ( +-  0.03% )


*Before* this patchset:

Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000' (200 runs):

         18.700051      task-clock (msec)         #    0.003 CPUs utilized            ( +-  1.04% )
                 8      context-switches          #    0.420 K/sec                    ( +-  0.31% )
                 0      cpu-migrations            #    0.019 K/sec                    ( +- 10.16% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.09% )
          48548980      cycles                    #    2.596 GHz                      ( +-  1.04% )
          61857980      instructions              #    1.27  insn per cycle           ( +-  0.09% )
          14814201      branches                  #  792.201 M/sec                    ( +-  0.08% )
             42007      branch-misses             #    0.28% of all branches          ( +-  0.11% )

       5.565806266 seconds time elapsed                                          ( +-  0.08% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=16 test_pop_cpu=17 test_alloc_len=12 nr_test=51200000 test_align=1' (200 runs):

         18.468618      task-clock (msec)         #    0.007 CPUs utilized            ( +-  1.14% )
                 8      context-switches          #    0.422 K/sec                    ( +-  0.43% )
                 0      cpu-migrations            #    0.026 K/sec                    ( +-  7.89% )
                81      page-faults               #    0.004 M/sec                    ( +-  0.08% )
          47950150      cycles                    #    2.596 GHz                      ( +-  1.14% )
          61745530      instructions              #    1.29  insn per cycle           ( +-  0.09% )
          14787783      branches                  #  800.698 M/sec                    ( +-  0.08% )
             41734      branch-misses             #    0.28% of all branches          ( +-  0.09% )

       2.584180919 seconds time elapsed                                          ( +-  0.04% )

 Performance counter stats for 'taskset -c 0 head -c 20G /dev/zero' (200 runs):

      17105.617450      task-clock (msec)         #    0.599 CPUs utilized            ( +-  0.02% )
           2822654      context-switches          #    0.165 M/sec                    ( +-  0.03% )
                 1      cpu-migrations            #    0.000 K/sec                    ( +-  0.50% )
                93      page-faults               #    0.005 K/sec                    ( +-  0.09% )
       31819244033      cycles                    #    1.860 GHz                      ( +-  0.03% )
       37297412811      instructions              #    1.17  insn per cycle           ( +-  0.01% )
        6676699757      branches                  #  390.322 M/sec                    ( +-  0.01% )
         325102016      branch-misses             #    4.87% of all branches          ( +-  0.06% )

      28.568053622 seconds time elapsed                                          ( +-  0.02% )

Note, ipv4-udp, ipv6-tcp and ipv6-udp is also tested with the below script:
nc -u -l -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u 127.0.0.1 1234

nc -l6 -k 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc ::1 1234

nc -l6 -k -u 1234 > /dev/null
perf stat -r 4 -- head -c 51200000000 /dev/zero | nc -u ::1 1234

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Shuah Khan <skhan@linuxfoundation.org>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>

1. https://lore.kernel.org/all/20241028115343.3405838-1-linyunsheng@huawei.com/
2. https://lore.kernel.org/all/20240228093013.8263-1-linyunsheng@huawei.com/
3. https://lore.kernel.org/all/472a7a09-387f-480d-b66c-761e0b6192ef@huawei.com/

V2: Repost based on the latest net-next.

V1: Rebase on latest net-next tree and redo the performance test.

RFC:
    1. CC Andrew and MM ML explicitly.
    2. Split into two parts according to the discussion in v22, and this is
       the part-2.
    3. Split 'introduce new API' patch to more patches to make more reviewable
       and easier to discuss.

Yunsheng Lin (10):
  mm: page_frag: some minor refactoring before adding new API
  net: rename skb_copy_to_page_nocache() helper
  mm: page_frag: update documentation for page_frag
  mm: page_frag: introduce page_frag_alloc_abort() related API
  mm: page_frag: introduce refill prepare & commit API
  mm: page_frag: introduce alloc_refill prepare & commit API
  mm: page_frag: introduce probe related API
  mm: page_frag: add testing for the newly added API
  net: replace page_frag with page_frag_cache
  mm: page_frag: add an entry in MAINTAINERS for page_frag

 Documentation/mm/page_frags.rst               | 207 ++++++++++-
 MAINTAINERS                                   |  12 +
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 101 ++----
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/tun.c                             |  47 ++-
 include/linux/page_frag_cache.h               | 330 +++++++++++++++++-
 include/linux/sched.h                         |   2 +-
 include/net/sock.h                            |  30 +-
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   3 +-
 mm/page_frag_cache.c                          | 108 +++++-
 net/core/skbuff.c                             |  58 +--
 net/core/skmsg.c                              |  12 +-
 net/core/sock.c                               |  32 +-
 net/ipv4/ip_output.c                          |  28 +-
 net/ipv4/tcp.c                                |  26 +-
 net/ipv4/tcp_output.c                         |  25 +-
 net/ipv6/ip6_output.c                         |  28 +-
 net/kcm/kcmsock.c                             |  21 +-
 net/mptcp/protocol.c                          |  47 ++-
 net/tls/tls_device.c                          | 100 ++++--
 .../selftests/mm/page_frag/page_frag_test.c   |  76 +++-
 tools/testing/selftests/mm/run_vmtests.sh     |   4 +
 tools/testing/selftests/mm/test_page_frag.sh  |  27 ++
 25 files changed, 1045 insertions(+), 288 deletions(-)

-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 01/10] mm: page_frag: some minor refactoring before adding new API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 02/10] net: rename skb_copy_to_page_nocache() helper Yunsheng Lin
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM

Refactor common codes from __page_frag_alloc_va_align() to
__page_frag_cache_prepare() and __page_frag_cache_commit(),
so that the new API can make use of them.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 include/linux/page_frag_cache.h | 34 ++++++++++++++++++++++++++--
 mm/page_frag_cache.c            | 40 ++++++++++++++++++++++++++-------
 2 files changed, 64 insertions(+), 10 deletions(-)

diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 41a91df82631..5ae97f93a0a1 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -5,6 +5,7 @@
 
 #include <linux/bits.h>
 #include <linux/log2.h>
+#include <linux/mmdebug.h>
 #include <linux/mm_types_task.h>
 #include <linux/types.h>
 
@@ -39,8 +40,37 @@ static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)
 
 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
-void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
-			      gfp_t gfp_mask, unsigned int align_mask);
+void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
+				struct page_frag *pfrag, gfp_t gfp_mask,
+				unsigned int align_mask);
+unsigned int __page_frag_cache_commit_noref(struct page_frag_cache *nc,
+					    struct page_frag *pfrag,
+					    unsigned int used_sz);
+
+static inline unsigned int __page_frag_cache_commit(struct page_frag_cache *nc,
+						    struct page_frag *pfrag,
+						    unsigned int used_sz)
+{
+	VM_BUG_ON(!nc->pagecnt_bias);
+	nc->pagecnt_bias--;
+
+	return __page_frag_cache_commit_noref(nc, pfrag, used_sz);
+}
+
+static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
+					    unsigned int fragsz, gfp_t gfp_mask,
+					    unsigned int align_mask)
+{
+	struct page_frag page_frag;
+	void *va;
+
+	va = __page_frag_cache_prepare(nc, fragsz, &page_frag, gfp_mask,
+				       align_mask);
+	if (likely(va))
+		__page_frag_cache_commit(nc, &page_frag, fragsz);
+
+	return va;
+}
 
 static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 					  unsigned int fragsz, gfp_t gfp_mask,
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 3f7a203d35c6..f55d34cf7d43 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -90,9 +90,31 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);
 
-void *__page_frag_alloc_align(struct page_frag_cache *nc,
-			      unsigned int fragsz, gfp_t gfp_mask,
-			      unsigned int align_mask)
+unsigned int __page_frag_cache_commit_noref(struct page_frag_cache *nc,
+					    struct page_frag *pfrag,
+					    unsigned int used_sz)
+{
+	unsigned int orig_offset;
+
+	VM_BUG_ON(used_sz > pfrag->size);
+	VM_BUG_ON(pfrag->page != encoded_page_decode_page(nc->encoded_page));
+	VM_BUG_ON(pfrag->offset + pfrag->size >
+		  (PAGE_SIZE << encoded_page_decode_order(nc->encoded_page)));
+
+	/* pfrag->offset might be bigger than the nc->offset due to alignment */
+	VM_BUG_ON(nc->offset > pfrag->offset);
+
+	orig_offset = nc->offset;
+	nc->offset = pfrag->offset + used_sz;
+
+	/* Return true size back to caller considering the offset alignment */
+	return nc->offset - orig_offset;
+}
+EXPORT_SYMBOL(__page_frag_cache_commit_noref);
+
+void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
+				struct page_frag *pfrag, gfp_t gfp_mask,
+				unsigned int align_mask)
 {
 	unsigned long encoded_page = nc->encoded_page;
 	unsigned int size, offset;
@@ -114,6 +136,8 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
 		nc->offset = 0;
+	} else {
+		page = encoded_page_decode_page(encoded_page);
 	}
 
 	size = PAGE_SIZE << encoded_page_decode_order(encoded_page);
@@ -132,8 +156,6 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 			return NULL;
 		}
 
-		page = encoded_page_decode_page(encoded_page);
-
 		if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
 			goto refill;
 
@@ -148,15 +170,17 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 
 		/* reset page count bias and offset to start of new frag */
 		nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
+		nc->offset = 0;
 		offset = 0;
 	}
 
-	nc->pagecnt_bias--;
-	nc->offset = offset + fragsz;
+	pfrag->page = page;
+	pfrag->offset = offset;
+	pfrag->size = size - offset;
 
 	return encoded_page_decode_virt(encoded_page) + offset;
 }
-EXPORT_SYMBOL(__page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_cache_prepare);
 
 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 02/10] net: rename skb_copy_to_page_nocache() helper
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 01/10] mm: page_frag: some minor refactoring before adding new API Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 03/10] mm: page_frag: update documentation for page_frag Yunsheng Lin
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Alexander Duyck, Eric Dumazet,
	Simon Horman, David Ahern

Rename skb_copy_to_page_nocache() to skb_copy_to_frag_nocache()
to avoid calling virt_to_page() as we are about to pass virtual
address directly.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: Alexander Duyck <alexanderduyck@fb.com>
---
 include/net/sock.h | 9 ++++-----
 net/ipv4/tcp.c     | 7 +++----
 net/kcm/kcmsock.c  | 7 +++----
 3 files changed, 10 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 7464e9f9f47c..cf037c870e3b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2203,15 +2203,14 @@ static inline int skb_add_data_nocache(struct sock *sk, struct sk_buff *skb,
 	return err;
 }
 
-static inline int skb_copy_to_page_nocache(struct sock *sk, struct iov_iter *from,
+static inline int skb_copy_to_frag_nocache(struct sock *sk,
+					   struct iov_iter *from,
 					   struct sk_buff *skb,
-					   struct page *page,
-					   int off, int copy)
+					   char *va, int copy)
 {
 	int err;
 
-	err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) + off,
-				       copy, skb->len);
+	err = skb_do_copy_data_nocache(sk, skb, from, va, copy, skb->len);
 	if (err)
 		return err;
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0d704bda6c41..0fbf1e222cda 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1219,10 +1219,9 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 			if (!copy)
 				goto wait_for_space;
 
-			err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
-						       pfrag->page,
-						       pfrag->offset,
-						       copy);
+			err = skb_copy_to_frag_nocache(sk, &msg->msg_iter, skb,
+						       page_address(pfrag->page) +
+						       pfrag->offset, copy);
 			if (err)
 				goto do_error;
 
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 24aec295a51c..94719d4af5fa 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -856,10 +856,9 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 			if (!sk_wmem_schedule(sk, copy))
 				goto wait_for_memory;
 
-			err = skb_copy_to_page_nocache(sk, &msg->msg_iter, skb,
-						       pfrag->page,
-						       pfrag->offset,
-						       copy);
+			err = skb_copy_to_frag_nocache(sk, &msg->msg_iter, skb,
+						       page_address(pfrag->page) +
+						       pfrag->offset, copy);
 			if (err)
 				goto out_error;
 
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 03/10] mm: page_frag: update documentation for page_frag
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 01/10] mm: page_frag: some minor refactoring before adding new API Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 02/10] net: rename skb_copy_to_page_nocache() helper Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 04/10] mm: page_frag: introduce page_frag_alloc_abort() related API Yunsheng Lin
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Jonathan Corbet, linux-doc

Update documentation about design, implementation and API usages
for page_frag.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 Documentation/mm/page_frags.rst | 110 +++++++++++++++++++++++++++++++-
 include/linux/page_frag_cache.h |  54 ++++++++++++++++
 mm/page_frag_cache.c            |  12 +++-
 3 files changed, 173 insertions(+), 3 deletions(-)

diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 503ca6cdb804..34e654c2956e 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -1,3 +1,5 @@
+.. SPDX-License-Identifier: GPL-2.0
+
 ==============
 Page fragments
 ==============
@@ -40,4 +42,110 @@ page via a single call.  The advantage to doing this is that it allows for
 cleaning up the multiple references that were added to a page in order to
 avoid calling get_page per allocation.
 
-Alexander Duyck, Nov 29, 2016.
+
+Architecture overview
+=====================
+
+.. code-block:: none
+
+                      +----------------------+
+                      | page_frag API caller |
+                      +----------------------+
+                                  |
+                                  |
+                                  v
+    +------------------------------------------------------------------+
+    |                   request page fragment                          |
+    +------------------------------------------------------------------+
+             |                                 |                     |
+             |                                 |                     |
+             |                          Cache not enough             |
+             |                                 |                     |
+             |                         +-----------------+           |
+             |                         | reuse old cache |--Usable-->|
+             |                         +-----------------+           |
+             |                                 |                     |
+             |                             Not usable                |
+             |                                 |                     |
+             |                                 v                     |
+        Cache empty                   +-----------------+            |
+             |                        | drain old cache |            |
+             |                        +-----------------+            |
+             |                                 |                     |
+             v_________________________________v                     |
+                              |                                      |
+                              |                                      |
+             _________________v_______________                       |
+            |                                 |              Cache is enough
+            |                                 |                      |
+ PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE         |                      |
+            |                                 |                      |
+            |               PAGE_SIZE >= PAGE_FRAG_CACHE_MAX_SIZE    |
+            v                                 |                      |
+    +----------------------------------+      |                      |
+    | refill cache with order > 0 page |      |                      |
+    +----------------------------------+      |                      |
+      |                    |                  |                      |
+      |                    |                  |                      |
+      |              Refill failed            |                      |
+      |                    |                  |                      |
+      |                    v                  v                      |
+      |      +------------------------------------+                  |
+      |      |   refill cache with order 0 page   |                  |
+      |      +----------------------------------=-+                  |
+      |                       |                                      |
+ Refill succeed               |                                      |
+      |                 Refill succeed                               |
+      |                       |                                      |
+      v                       v                                      v
+    +------------------------------------------------------------------+
+    |             allocate fragment from cache                         |
+    +------------------------------------------------------------------+
+
+API interface
+=============
+
+Depending on different aligning requirement, the page_frag API caller may call
+page_frag_*_align*() to ensure the returned virtual address or offset of the
+page is aligned according to the 'align/alignment' parameter. Note the size of
+the allocated fragment is not aligned, the caller needs to provide an aligned
+fragsz if there is an alignment requirement for the size of the fragment.
+
+.. kernel-doc:: include/linux/page_frag_cache.h
+   :identifiers: page_frag_cache_init page_frag_cache_is_pfmemalloc
+		 __page_frag_alloc_align page_frag_alloc_align page_frag_alloc
+
+.. kernel-doc:: mm/page_frag_cache.c
+   :identifiers: page_frag_cache_drain page_frag_free
+
+Coding examples
+===============
+
+Initialization and draining API
+-------------------------------
+
+.. code-block:: c
+
+   page_frag_cache_init(nc);
+   ...
+   page_frag_cache_drain(nc);
+
+
+Allocation & freeing API
+------------------------
+
+.. code-block:: c
+
+    void *va;
+
+    va = page_frag_alloc_align(nc, size, gfp, align);
+    if (!va)
+        goto do_error;
+
+    err = do_something(va, size);
+    if (err)
+        goto do_error;
+
+    ...
+
+    page_frag_free(va);
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 5ae97f93a0a1..a2b1127e8ac8 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -28,11 +28,28 @@ static inline bool encoded_page_decode_pfmemalloc(unsigned long encoded_page)
 	return !!(encoded_page & PAGE_FRAG_CACHE_PFMEMALLOC_BIT);
 }
 
+/**
+ * page_frag_cache_init() - Init page_frag cache.
+ * @nc: page_frag cache from which to init
+ *
+ * Inline helper to initialize the page_frag cache.
+ */
 static inline void page_frag_cache_init(struct page_frag_cache *nc)
 {
 	nc->encoded_page = 0;
 }
 
+/**
+ * page_frag_cache_is_pfmemalloc() - Check for pfmemalloc.
+ * @nc: page_frag cache from which to check
+ *
+ * Check if the current page in page_frag cache is allocated from the pfmemalloc
+ * reserves. It has the same calling context expectation as the allocation API.
+ *
+ * Return:
+ * true if the current page in page_frag cache is allocated from the pfmemalloc
+ * reserves, otherwise return false.
+ */
 static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)
 {
 	return encoded_page_decode_pfmemalloc(nc->encoded_page);
@@ -57,6 +74,19 @@ static inline unsigned int __page_frag_cache_commit(struct page_frag_cache *nc,
 	return __page_frag_cache_commit_noref(nc, pfrag, used_sz);
 }
 
+/**
+ * __page_frag_alloc_align() - Allocate a page fragment with aligning
+ * requirement.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align_mask: the requested aligning requirement for the 'va'
+ *
+ * Allocate a page fragment from page_frag cache with aligning requirement.
+ *
+ * Return:
+ * Virtual address of the page fragment, otherwise return NULL.
+ */
 static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
 					    unsigned int fragsz, gfp_t gfp_mask,
 					    unsigned int align_mask)
@@ -72,6 +102,19 @@ static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
 	return va;
 }
 
+/**
+ * page_frag_alloc_align() - Allocate a page fragment with aligning requirement.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache needs to be refilled
+ * @align: the requested aligning requirement for the fragment
+ *
+ * WARN_ON_ONCE() checking for @align before allocating a page fragment from
+ * page_frag cache with aligning requirement.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
 static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 					  unsigned int fragsz, gfp_t gfp_mask,
 					  unsigned int align)
@@ -80,6 +123,17 @@ static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 	return __page_frag_alloc_align(nc, fragsz, gfp_mask, -align);
 }
 
+/**
+ * page_frag_alloc() - Allocate a page fragment.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Allocate a page fragment from page_frag cache.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
 static inline void *page_frag_alloc(struct page_frag_cache *nc,
 				    unsigned int fragsz, gfp_t gfp_mask)
 {
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index f55d34cf7d43..d014130fb893 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -70,6 +70,10 @@ static struct page *__page_frag_cache_refill(struct page_frag_cache *nc,
 	return page;
 }
 
+/**
+ * page_frag_cache_drain - Drain the current page from page_frag cache.
+ * @nc: page_frag cache from which to drain
+ */
 void page_frag_cache_drain(struct page_frag_cache *nc)
 {
 	if (!nc->encoded_page)
@@ -182,8 +186,12 @@ void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
 }
 EXPORT_SYMBOL(__page_frag_cache_prepare);
 
-/*
- * Frees a page fragment allocated out of either a compound or order 0 page.
+/**
+ * page_frag_free - Free a page fragment.
+ * @addr: va of page fragment to be freed
+ *
+ * Free a page fragment allocated out of either a compound or order 0 page by
+ * virtual address.
  */
 void page_frag_free(void *addr)
 {
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 04/10] mm: page_frag: introduce page_frag_alloc_abort() related API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (2 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 03/10] mm: page_frag: update documentation for page_frag Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 05/10] mm: page_frag: introduce refill prepare & commit API Yunsheng Lin
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Jonathan Corbet, linux-doc

For some case as tun_build_skb() without the needing of
using complicated prepare & commit API, add the abort API to
abort the operation of page_frag_alloc_*() related API for
error handling knowing that no one else is taking extra
reference to the just allocated fragment, and add abort_ref
API to only abort the reference counting of the allocated
fragment if it is already referenced by someone else.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 Documentation/mm/page_frags.rst |  7 +++++--
 include/linux/page_frag_cache.h | 20 ++++++++++++++++++++
 mm/page_frag_cache.c            | 21 +++++++++++++++++++++
 3 files changed, 46 insertions(+), 2 deletions(-)

diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 34e654c2956e..339e641beb53 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -114,9 +114,10 @@ fragsz if there is an alignment requirement for the size of the fragment.
 .. kernel-doc:: include/linux/page_frag_cache.h
    :identifiers: page_frag_cache_init page_frag_cache_is_pfmemalloc
 		 __page_frag_alloc_align page_frag_alloc_align page_frag_alloc
+		 page_frag_alloc_abort
 
 .. kernel-doc:: mm/page_frag_cache.c
-   :identifiers: page_frag_cache_drain page_frag_free
+   :identifiers: page_frag_cache_drain page_frag_free page_frag_alloc_abort_ref
 
 Coding examples
 ===============
@@ -143,8 +144,10 @@ Allocation & freeing API
         goto do_error;
 
     err = do_something(va, size);
-    if (err)
+    if (err) {
+        page_frag_alloc_abort(nc, va, size);
         goto do_error;
+    }
 
     ...
 
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index a2b1127e8ac8..c3347c97522c 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -141,5 +141,25 @@ static inline void *page_frag_alloc(struct page_frag_cache *nc,
 }
 
 void page_frag_free(void *addr);
+void page_frag_alloc_abort_ref(struct page_frag_cache *nc, void *va,
+			       unsigned int fragsz);
+
+/**
+ * page_frag_alloc_abort - Abort the page fragment allocation.
+ * @nc: page_frag cache to which the page fragment is aborted back
+ * @va: virtual address of page fragment to be aborted
+ * @fragsz: size of the page fragment to be aborted
+ *
+ * It is expected to be called from the same context as the allocation API.
+ * Mostly used for error handling cases to abort the fragment allocation knowing
+ * that no one else is taking extra reference to the just aborted fragment, so
+ * that the aborted fragment can be reused.
+ */
+static inline void page_frag_alloc_abort(struct page_frag_cache *nc, void *va,
+					 unsigned int fragsz)
+{
+	page_frag_alloc_abort_ref(nc, va, fragsz);
+	nc->offset -= fragsz;
+}
 
 #endif
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index d014130fb893..8c3cfdbe8c2b 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -201,3 +201,24 @@ void page_frag_free(void *addr)
 		free_unref_page(page, compound_order(page));
 }
 EXPORT_SYMBOL(page_frag_free);
+
+/**
+ * page_frag_alloc_abort_ref - Abort the reference of allocated fragment.
+ * @nc: page_frag cache to which the page fragment is aborted back
+ * @va: virtual address of page fragment to be aborted
+ * @fragsz: size of the page fragment to be aborted
+ *
+ * It is expected to be called from the same context as the allocation API.
+ * Mostly used for error handling cases to abort the reference of allocated
+ * fragment if the fragment has been referenced for other usages, to avoid the
+ * atomic operation of page_frag_free() API.
+ */
+void page_frag_alloc_abort_ref(struct page_frag_cache *nc, void *va,
+			       unsigned int fragsz)
+{
+	VM_BUG_ON(va + fragsz !=
+		  encoded_page_decode_virt(nc->encoded_page) + nc->offset);
+
+	nc->pagecnt_bias++;
+}
+EXPORT_SYMBOL(page_frag_alloc_abort_ref);
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 05/10] mm: page_frag: introduce refill prepare & commit API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (3 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 04/10] mm: page_frag: introduce page_frag_alloc_abort() related API Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 06/10] mm: page_frag: introduce alloc_refill " Yunsheng Lin
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Jonathan Corbet, linux-doc

Currently page_frag only have a alloc API which returns
the virtual address of a fragment by a specific size.

There are many use cases that need minimum memory in order
for forward progress, but more performant if more memory is
available, and expect to use the 'struct page' of the
allocated fragment directly instead of the virtual address.

Currently skb_page_frag_refill() API is used to solve the
above use cases, but caller needs to know about the internal
detail and access the data field of 'struct page_frag' to
meet the requirement of the above use cases and its
implementation is similar to the one in mm subsystem.

To unify those two page_frag implementations, introduce a
prepare API to ensure minimum memory is satisfied and return
how much the actual memory is available to the caller. The
caller needs to either call the commit API to report how much
memory it actually uses, or not do so if deciding to not use
any memory.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 Documentation/mm/page_frags.rst |  43 ++++++++++++-
 include/linux/page_frag_cache.h | 110 ++++++++++++++++++++++++++++++++
 2 files changed, 152 insertions(+), 1 deletion(-)

diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 339e641beb53..4cfdbe7db55a 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -111,10 +111,18 @@ page is aligned according to the 'align/alignment' parameter. Note the size of
 the allocated fragment is not aligned, the caller needs to provide an aligned
 fragsz if there is an alignment requirement for the size of the fragment.
 
+There is a use case that needs minimum memory in order for forward progress, but
+more performant if more memory is available. By using the prepare and commit
+related API, the caller calls prepare API to requests the minimum memory it
+needs and prepare API will return the maximum size of the fragment returned. The
+caller needs to either call the commit API to report how much memory it actually
+uses, or not do so if deciding to not use any memory.
+
 .. kernel-doc:: include/linux/page_frag_cache.h
    :identifiers: page_frag_cache_init page_frag_cache_is_pfmemalloc
 		 __page_frag_alloc_align page_frag_alloc_align page_frag_alloc
-		 page_frag_alloc_abort
+		 page_frag_alloc_abort __page_frag_refill_prepare_align
+		 page_frag_refill_prepare_align page_frag_refill_prepare
 
 .. kernel-doc:: mm/page_frag_cache.c
    :identifiers: page_frag_cache_drain page_frag_free page_frag_alloc_abort_ref
@@ -152,3 +160,36 @@ Allocation & freeing API
     ...
 
     page_frag_free(va);
+
+
+Refill Preparation & committing API
+-----------------------------------
+
+.. code-block:: c
+
+    struct page_frag page_frag, *pfrag;
+    bool merge = true;
+
+    pfrag = &page_frag;
+    if (!page_frag_refill_prepare(nc, 32U, pfrag, GFP_KERNEL))
+        goto wait_for_space;
+
+    copy = min_t(unsigned int, copy, pfrag->size);
+    if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) {
+        if (i >= max_skb_frags)
+            goto new_segment;
+
+        merge = false;
+    }
+
+    copy = mem_schedule(copy);
+    if (!copy)
+        goto wait_for_space;
+
+    if (merge) {
+        skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+        page_frag_refill_commit_noref(nc, pfrag, copy);
+    } else {
+        skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy);
+        page_frag_refill_commit(nc, pfrag, copy);
+    }
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index c3347c97522c..1e699334646a 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -140,6 +140,116 @@ static inline void *page_frag_alloc(struct page_frag_cache *nc,
 	return __page_frag_alloc_align(nc, fragsz, gfp_mask, ~0u);
 }
 
+/**
+ * __page_frag_refill_prepare_align() - Prepare refilling a page_frag with
+ * aligning requirement.
+ * @nc: page_frag cache from which to refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align_mask: the requested aligning requirement for the fragment
+ *
+ * Prepare refilling a page_frag from page_frag cache with aligning requirement.
+ *
+ * Return:
+ * True if prepare refilling succeeds, otherwise return false.
+ */
+static inline bool __page_frag_refill_prepare_align(struct page_frag_cache *nc,
+						    unsigned int fragsz,
+						    struct page_frag *pfrag,
+						    gfp_t gfp_mask,
+						    unsigned int align_mask)
+{
+	return !!__page_frag_cache_prepare(nc, fragsz, pfrag, gfp_mask,
+					   align_mask);
+}
+
+/**
+ * page_frag_refill_prepare_align() - Prepare refilling a page_frag with
+ * aligning requirement.
+ * @nc: page_frag cache from which to refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache needs to be refilled
+ * @align: the requested aligning requirement for the fragment
+ *
+ * WARN_ON_ONCE() checking for @align before prepare refilling a page_frag from
+ * page_frag cache with aligning requirement.
+ *
+ * Return:
+ * True if prepare refilling succeeds, otherwise return false.
+ */
+static inline bool page_frag_refill_prepare_align(struct page_frag_cache *nc,
+						  unsigned int fragsz,
+						  struct page_frag *pfrag,
+						  gfp_t gfp_mask,
+						  unsigned int align)
+{
+	WARN_ON_ONCE(!is_power_of_2(align));
+	return __page_frag_refill_prepare_align(nc, fragsz, pfrag, gfp_mask,
+						-align);
+}
+
+/**
+ * page_frag_refill_prepare() - Prepare refilling a page_frag.
+ * @nc: page_frag cache from which to refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * True if refill succeeds, otherwise return false.
+ */
+static inline bool page_frag_refill_prepare(struct page_frag_cache *nc,
+					    unsigned int fragsz,
+					    struct page_frag *pfrag,
+					    gfp_t gfp_mask)
+{
+	return __page_frag_refill_prepare_align(nc, fragsz, pfrag, gfp_mask,
+						~0u);
+}
+
+/**
+ * page_frag_refill_commit - Commit a prepare refilling.
+ * @nc: page_frag cache from which to commit
+ * @pfrag: the page_frag to be committed
+ * @used_sz: size of the page fragment has been used
+ *
+ * Commit the actual used size for the refill that was prepared.
+ *
+ * Return:
+ * The true size of the fragment considering the offset alignment.
+ */
+static inline unsigned int page_frag_refill_commit(struct page_frag_cache *nc,
+						   struct page_frag *pfrag,
+						   unsigned int used_sz)
+{
+	return __page_frag_cache_commit(nc, pfrag, used_sz);
+}
+
+/**
+ * page_frag_refill_commit_noref - Commit a prepare refilling without taking
+ * refcount.
+ * @nc: page_frag cache from which to commit
+ * @pfrag: the page_frag to be committed
+ * @used_sz: size of the page fragment has been used
+ *
+ * Commit the prepare refilling by passing the actual used size, but not taking
+ * refcount. Mostly used for fragmemt coalescing case when the current fragment
+ * can share the same refcount with previous fragment.
+ *
+ * Return:
+ * The true size of the fragment considering the offset alignment.
+ */
+static inline unsigned int
+page_frag_refill_commit_noref(struct page_frag_cache *nc,
+			      struct page_frag *pfrag, unsigned int used_sz)
+{
+	return __page_frag_cache_commit_noref(nc, pfrag, used_sz);
+}
+
 void page_frag_free(void *addr);
 void page_frag_alloc_abort_ref(struct page_frag_cache *nc, void *va,
 			       unsigned int fragsz);
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 06/10] mm: page_frag: introduce alloc_refill prepare & commit API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (4 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 05/10] mm: page_frag: introduce refill prepare & commit API Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 07/10] mm: page_frag: introduce probe related API Yunsheng Lin
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Jonathan Corbet, linux-doc

Currently alloc related API returns virtual address of the
allocated fragment and refill related API returns page info
of the allocated fragment through 'struct page_frag'.

There are use cases that need both the virtual address and
page info of the allocated fragment. Introduce alloc_refill
API for those use cases.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 Documentation/mm/page_frags.rst | 45 +++++++++++++++++++++
 include/linux/page_frag_cache.h | 71 +++++++++++++++++++++++++++++++++
 2 files changed, 116 insertions(+)

diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 4cfdbe7db55a..1c98f7090d92 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -111,6 +111,9 @@ page is aligned according to the 'align/alignment' parameter. Note the size of
 the allocated fragment is not aligned, the caller needs to provide an aligned
 fragsz if there is an alignment requirement for the size of the fragment.
 
+Depending on different use cases, callers expecting to deal with va, page or
+both va and page may call alloc, refill or alloc_refill API accordingly.
+
 There is a use case that needs minimum memory in order for forward progress, but
 more performant if more memory is available. By using the prepare and commit
 related API, the caller calls prepare API to requests the minimum memory it
@@ -123,6 +126,9 @@ uses, or not do so if deciding to not use any memory.
 		 __page_frag_alloc_align page_frag_alloc_align page_frag_alloc
 		 page_frag_alloc_abort __page_frag_refill_prepare_align
 		 page_frag_refill_prepare_align page_frag_refill_prepare
+		 __page_frag_alloc_refill_prepare_align
+		 page_frag_alloc_refill_prepare_align
+		 page_frag_alloc_refill_prepare
 
 .. kernel-doc:: mm/page_frag_cache.c
    :identifiers: page_frag_cache_drain page_frag_free page_frag_alloc_abort_ref
@@ -193,3 +199,42 @@ Refill Preparation & committing API
         skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy);
         page_frag_refill_commit(nc, pfrag, copy);
     }
+
+
+Alloc_Refill Preparation & committing API
+-----------------------------------------
+
+.. code-block:: c
+
+    struct page_frag page_frag, *pfrag;
+    bool merge = true;
+    void *va;
+
+    pfrag = &page_frag;
+    va = page_frag_alloc_refill_prepare(nc, 32U, pfrag, GFP_KERNEL);
+    if (!va)
+        goto wait_for_space;
+
+    copy = min_t(unsigned int, copy, pfrag->size);
+    if (!skb_can_coalesce(skb, i, pfrag->page, pfrag->offset)) {
+        if (i >= max_skb_frags)
+            goto new_segment;
+
+        merge = false;
+    }
+
+    copy = mem_schedule(copy);
+    if (!copy)
+        goto wait_for_space;
+
+    err = copy_from_iter_full_nocache(va, copy, iter);
+    if (err)
+        goto do_error;
+
+    if (merge) {
+        skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+        page_frag_refill_commit_noref(nc, pfrag, copy);
+    } else {
+        skb_fill_page_desc(skb, i, pfrag->page, pfrag->offset, copy);
+        page_frag_refill_commit(nc, pfrag, copy);
+    }
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 1e699334646a..329390afbe78 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -211,6 +211,77 @@ static inline bool page_frag_refill_prepare(struct page_frag_cache *nc,
 						~0u);
 }
 
+/**
+ * __page_frag_alloc_refill_prepare_align() - Prepare allocating a fragment and
+ * refilling a page_frag with aligning requirement.
+ * @nc: page_frag cache from which to allocate and refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align_mask: the requested aligning requirement for the fragment.
+ *
+ * Prepare allocating a fragment and refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
+static inline void
+*__page_frag_alloc_refill_prepare_align(struct page_frag_cache *nc,
+					unsigned int fragsz,
+					struct page_frag *pfrag,
+					gfp_t gfp_mask, unsigned int align_mask)
+{
+	return __page_frag_cache_prepare(nc, fragsz, pfrag, gfp_mask, align_mask);
+}
+
+/**
+ * page_frag_alloc_refill_prepare_align() - Prepare allocating a fragment and
+ * refilling a page_frag with aligning requirement.
+ * @nc: page_frag cache from which to allocate and refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align: the requested aligning requirement for the fragment.
+ *
+ * WARN_ON_ONCE() checking for @align before prepare allocating a fragment and
+ * refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
+static inline void
+*page_frag_alloc_refill_prepare_align(struct page_frag_cache *nc,
+				      unsigned int fragsz,
+				      struct page_frag *pfrag, gfp_t gfp_mask,
+				      unsigned int align)
+{
+	WARN_ON_ONCE(!is_power_of_2(align));
+	return __page_frag_alloc_refill_prepare_align(nc, fragsz, pfrag,
+						      gfp_mask, -align);
+}
+
+/**
+ * page_frag_alloc_refill_prepare() - Prepare allocating a fragment and
+ * refilling a page_frag.
+ * @nc: page_frag cache from which to allocate and refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare allocating a fragment and refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
+static inline void *page_frag_alloc_refill_prepare(struct page_frag_cache *nc,
+						   unsigned int fragsz,
+						   struct page_frag *pfrag,
+						   gfp_t gfp_mask)
+{
+	return __page_frag_alloc_refill_prepare_align(nc, fragsz, pfrag,
+						      gfp_mask, ~0u);
+}
+
 /**
  * page_frag_refill_commit - Commit a prepare refilling.
  * @nc: page_frag cache from which to commit
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 07/10] mm: page_frag: introduce probe related API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (5 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 06/10] mm: page_frag: introduce alloc_refill " Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 08/10] mm: page_frag: add testing for the newly added API Yunsheng Lin
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Jonathan Corbet, linux-doc

Some usecase may need a bigger fragment if current fragment
can't be coalesced to previous fragment because more space
for some header may be needed if it is a new fragment. So
introduce probe related API to tell if there are minimum
remaining memory in the cache to be coalesced to the previous
fragment, in order to save memory as much as possible.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 Documentation/mm/page_frags.rst | 10 +++++++-
 include/linux/page_frag_cache.h | 41 +++++++++++++++++++++++++++++++++
 mm/page_frag_cache.c            | 35 ++++++++++++++++++++++++++++
 3 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 1c98f7090d92..3e34831a0029 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -119,7 +119,13 @@ more performant if more memory is available. By using the prepare and commit
 related API, the caller calls prepare API to requests the minimum memory it
 needs and prepare API will return the maximum size of the fragment returned. The
 caller needs to either call the commit API to report how much memory it actually
-uses, or not do so if deciding to not use any memory.
+uses, or not do so if deciding to not use any memory. Some usecase may need a
+bigger fragment if the current fragment can't be coalesced to previous fragment
+because more space for some header may be needed if it is a new fragment, probe
+related API can be used to tell if there are minimum remaining memory in the
+cache to be coalesced to the previous fragment, in order to save memory as much
+as possible.
+
 
 .. kernel-doc:: include/linux/page_frag_cache.h
    :identifiers: page_frag_cache_init page_frag_cache_is_pfmemalloc
@@ -129,9 +135,11 @@ uses, or not do so if deciding to not use any memory.
 		 __page_frag_alloc_refill_prepare_align
 		 page_frag_alloc_refill_prepare_align
 		 page_frag_alloc_refill_prepare
+                 page_frag_alloc_refill_probe page_frag_refill_probe
 
 .. kernel-doc:: mm/page_frag_cache.c
    :identifiers: page_frag_cache_drain page_frag_free page_frag_alloc_abort_ref
+                 __page_frag_alloc_refill_probe_align
 
 Coding examples
 ===============
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 329390afbe78..0f7e8da91a67 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -63,6 +63,10 @@ void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
 unsigned int __page_frag_cache_commit_noref(struct page_frag_cache *nc,
 					    struct page_frag *pfrag,
 					    unsigned int used_sz);
+void *__page_frag_alloc_refill_probe_align(struct page_frag_cache *nc,
+					   unsigned int fragsz,
+					   struct page_frag *pfrag,
+					   unsigned int align_mask);
 
 static inline unsigned int __page_frag_cache_commit(struct page_frag_cache *nc,
 						    struct page_frag *pfrag,
@@ -282,6 +286,43 @@ static inline void *page_frag_alloc_refill_prepare(struct page_frag_cache *nc,
 						      gfp_mask, ~0u);
 }
 
+/**
+ * page_frag_alloc_refill_probe() - Probe allocating a fragment and refilling
+ * a page_frag.
+ * @nc: page_frag cache from which to allocate and refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled
+ *
+ * Probe allocating a fragment and refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
+static inline void *page_frag_alloc_refill_probe(struct page_frag_cache *nc,
+						 unsigned int fragsz,
+						 struct page_frag *pfrag)
+{
+	return __page_frag_alloc_refill_probe_align(nc, fragsz, pfrag, ~0u);
+}
+
+/**
+ * page_frag_refill_probe() - Probe refilling a page_frag.
+ * @nc: page_frag cache from which to refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled
+ *
+ * Probe refilling a page_frag from page_frag cache.
+ *
+ * Return:
+ * True if refill succeeds, otherwise return false.
+ */
+static inline bool page_frag_refill_probe(struct page_frag_cache *nc,
+					  unsigned int fragsz,
+					  struct page_frag *pfrag)
+{
+	return !!page_frag_alloc_refill_probe(nc, fragsz, pfrag);
+}
+
 /**
  * page_frag_refill_commit - Commit a prepare refilling.
  * @nc: page_frag cache from which to commit
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 8c3cfdbe8c2b..ae40520d452a 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -116,6 +116,41 @@ unsigned int __page_frag_cache_commit_noref(struct page_frag_cache *nc,
 }
 EXPORT_SYMBOL(__page_frag_cache_commit_noref);
 
+/**
+ * __page_frag_alloc_refill_probe_align() - Probe allocating a fragment and
+ * refilling a page_frag with aligning requirement.
+ * @nc: page_frag cache from which to allocate and refill
+ * @fragsz: the requested fragment size
+ * @pfrag: the page_frag to be refilled.
+ * @align_mask: the requested aligning requirement for the fragment.
+ *
+ * Probe allocating a fragment and refilling a page_frag from page_frag cache
+ * with aligning requirement.
+ *
+ * Return:
+ * virtual address of the page fragment, otherwise return NULL.
+ */
+void *__page_frag_alloc_refill_probe_align(struct page_frag_cache *nc,
+					   unsigned int fragsz,
+					   struct page_frag *pfrag,
+					   unsigned int align_mask)
+{
+	unsigned long encoded_page = nc->encoded_page;
+	unsigned int size, offset;
+
+	size = PAGE_SIZE << encoded_page_decode_order(encoded_page);
+	offset = __ALIGN_KERNEL_MASK(nc->offset, ~align_mask);
+	if (unlikely(!encoded_page || offset + fragsz > size))
+		return NULL;
+
+	pfrag->page = encoded_page_decode_page(encoded_page);
+	pfrag->size = size - offset;
+	pfrag->offset = offset;
+
+	return encoded_page_decode_virt(encoded_page) + offset;
+}
+EXPORT_SYMBOL(__page_frag_alloc_refill_probe_align);
+
 void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
 				struct page_frag *pfrag, gfp_t gfp_mask,
 				unsigned int align_mask)
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 08/10] mm: page_frag: add testing for the newly added API
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (6 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 07/10] mm: page_frag: introduce probe related API Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 09/10] net: replace page_frag with page_frag_cache Yunsheng Lin
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Shuah Khan, linux-kselftest

Add testing for the newly added prepare API, for both aligned
and non-aligned API, also probe API is also tested along with
prepare API.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 .../selftests/mm/page_frag/page_frag_test.c   | 76 +++++++++++++++++--
 tools/testing/selftests/mm/run_vmtests.sh     |  4 +
 tools/testing/selftests/mm/test_page_frag.sh  | 27 +++++++
 3 files changed, 102 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c
index e806c1866e36..3b3c32389def 100644
--- a/tools/testing/selftests/mm/page_frag/page_frag_test.c
+++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c
@@ -32,6 +32,10 @@ static bool test_align;
 module_param(test_align, bool, 0);
 MODULE_PARM_DESC(test_align, "use align API for testing");
 
+static bool test_prepare;
+module_param(test_prepare, bool, 0);
+MODULE_PARM_DESC(test_prepare, "use prepare API for testing");
+
 static int test_alloc_len = 2048;
 module_param(test_alloc_len, int, 0);
 MODULE_PARM_DESC(test_alloc_len, "alloc len for testing");
@@ -74,6 +78,21 @@ static int page_frag_pop_thread(void *arg)
 	return 0;
 }
 
+static void frag_frag_test_commit(struct page_frag_cache *nc,
+				  struct page_frag *prepare_pfrag,
+				  struct page_frag *probe_pfrag,
+				  unsigned int used_sz)
+{
+	if (prepare_pfrag->page != probe_pfrag->page ||
+	    prepare_pfrag->offset != probe_pfrag->offset ||
+	    prepare_pfrag->size != probe_pfrag->size) {
+		force_exit = true;
+		WARN_ONCE(true, TEST_FAILED_PREFIX "wrong probed info\n");
+	}
+
+	page_frag_refill_commit(nc, prepare_pfrag, used_sz);
+}
+
 static int page_frag_push_thread(void *arg)
 {
 	struct ptr_ring *ring = arg;
@@ -86,15 +105,61 @@ static int page_frag_push_thread(void *arg)
 		int ret;
 
 		if (test_align) {
-			va = page_frag_alloc_align(&test_nc, test_alloc_len,
-						   GFP_KERNEL, SMP_CACHE_BYTES);
+			if (test_prepare) {
+				struct page_frag prepare_frag, probe_frag;
+				void *probe_va;
+
+				va = page_frag_alloc_refill_prepare_align(&test_nc,
+									  test_alloc_len,
+									  &prepare_frag,
+									  GFP_KERNEL,
+									  SMP_CACHE_BYTES);
+
+				probe_va = __page_frag_alloc_refill_probe_align(&test_nc,
+										test_alloc_len,
+										&probe_frag,
+										-SMP_CACHE_BYTES);
+				if (va != probe_va) {
+					force_exit = true;
+					WARN_ONCE(true, TEST_FAILED_PREFIX "wrong va\n");
+				}
+
+				if (likely(va))
+					frag_frag_test_commit(&test_nc, &prepare_frag,
+							      &probe_frag, test_alloc_len);
+			} else {
+				va = page_frag_alloc_align(&test_nc,
+							   test_alloc_len,
+							   GFP_KERNEL,
+							   SMP_CACHE_BYTES);
+			}
 
 			if ((unsigned long)va & (SMP_CACHE_BYTES - 1)) {
 				force_exit = true;
 				WARN_ONCE(true, TEST_FAILED_PREFIX "unaligned va returned\n");
 			}
 		} else {
-			va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL);
+			if (test_prepare) {
+				struct page_frag prepare_frag, probe_frag;
+				void *probe_va;
+
+				va = page_frag_alloc_refill_prepare(&test_nc, test_alloc_len,
+								    &prepare_frag, GFP_KERNEL);
+
+				probe_va = page_frag_alloc_refill_probe(&test_nc, test_alloc_len,
+									&probe_frag);
+
+				if (va != probe_va) {
+					force_exit = true;
+					WARN_ONCE(true, TEST_FAILED_PREFIX "wrong va\n");
+				}
+
+				if (likely(va))
+					frag_frag_test_commit(&test_nc, &prepare_frag,
+							      &probe_frag, test_alloc_len);
+			} else {
+				va = page_frag_alloc(&test_nc, test_alloc_len, GFP_KERNEL);
+			}
 		}
 
 		if (!va)
@@ -176,8 +241,9 @@ static int __init page_frag_test_init(void)
 	}
 
 	duration = (u64)ktime_us_delta(ktime_get(), start);
-	pr_info("%d of iterations for %s testing took: %lluus\n", nr_test,
-		test_align ? "aligned" : "non-aligned", duration);
+	pr_info("%d of iterations for %s %s API testing took: %lluus\n", nr_test,
+		test_align ? "aligned" : "non-aligned",
+		test_prepare ? "prepare" : "alloc", duration);
 
 out:
 	ptr_ring_cleanup(&ptr_ring, NULL);
diff --git a/tools/testing/selftests/mm/run_vmtests.sh b/tools/testing/selftests/mm/run_vmtests.sh
index 2fc290d9430c..881c17803baf 100755
--- a/tools/testing/selftests/mm/run_vmtests.sh
+++ b/tools/testing/selftests/mm/run_vmtests.sh
@@ -466,6 +466,10 @@ CATEGORY="page_frag" run_test ./test_page_frag.sh aligned
 
 CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned
 
+CATEGORY="page_frag" run_test ./test_page_frag.sh aligned_prepare
+
+CATEGORY="page_frag" run_test ./test_page_frag.sh nonaligned_prepare
+
 echo "SUMMARY: PASS=${count_pass} SKIP=${count_skip} FAIL=${count_fail}" | tap_prefix
 echo "1..${count_total}" | tap_output
 
diff --git a/tools/testing/selftests/mm/test_page_frag.sh b/tools/testing/selftests/mm/test_page_frag.sh
index f55b105084cf..1c757fd11844 100755
--- a/tools/testing/selftests/mm/test_page_frag.sh
+++ b/tools/testing/selftests/mm/test_page_frag.sh
@@ -43,6 +43,8 @@ check_test_failed_prefix() {
 SMOKE_PARAM="test_push_cpu=$TEST_CPU_0 test_pop_cpu=$TEST_CPU_1"
 NONALIGNED_PARAM="$SMOKE_PARAM test_alloc_len=75 nr_test=$NR_TEST"
 ALIGNED_PARAM="$NONALIGNED_PARAM test_align=1"
+NONALIGNED_PREPARE_PARAM="$NONALIGNED_PARAM test_prepare=1"
+ALIGNED_PREPARE_PARAM="$ALIGNED_PARAM test_prepare=1"
 
 check_test_requirements()
 {
@@ -77,6 +79,20 @@ run_aligned_check()
 	insmod $DRIVER $ALIGNED_PARAM > /dev/null 2>&1
 }
 
+run_nonaligned_prepare_check()
+{
+	echo "Run performance tests to evaluate how fast nonaligned prepare API is."
+
+	insmod $DRIVER $NONALIGNED_PREPARE_PARAM > /dev/null 2>&1
+}
+
+run_aligned_prepare_check()
+{
+	echo "Run performance tests to evaluate how fast aligned prepare API is."
+
+	insmod $DRIVER $ALIGNED_PREPARE_PARAM > /dev/null 2>&1
+}
+
 run_smoke_check()
 {
 	echo "Run smoke test."
@@ -87,6 +103,7 @@ run_smoke_check()
 usage()
 {
 	echo -n "Usage: $0 [ aligned ] | [ nonaligned ] | | [ smoke ] | "
+	echo "[ aligned_prepare ] | [ nonaligned_prepare ] | "
 	echo "manual parameters"
 	echo
 	echo "Valid tests and parameters:"
@@ -107,6 +124,12 @@ usage()
 	echo "# Performance testing for aligned alloc API"
 	echo "$0 aligned"
 	echo
+	echo "# Performance testing for nonaligned prepare API"
+	echo "$0 nonaligned_prepare"
+	echo
+	echo "# Performance testing for aligned prepare API"
+	echo "$0 aligned_prepare"
+	echo
 	exit 0
 }
 
@@ -158,6 +181,10 @@ function run_test()
 			run_nonaligned_check
 		elif [[ "$1" = "aligned" ]]; then
 			run_aligned_check
+		elif [[ "$1" = "nonaligned_prepare" ]]; then
+			run_nonaligned_prepare_check
+		elif [[ "$1" = "aligned_prepare" ]]; then
+			run_aligned_prepare_check
 		else
 			run_manual_check $@
 		fi
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 09/10] net: replace page_frag with page_frag_cache
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (7 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 08/10] mm: page_frag: add testing for the newly added API Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-06 12:25 ` [PATCH net-next v2 10/10] mm: page_frag: add an entry in MAINTAINERS for page_frag Yunsheng Lin
  2024-12-08 21:34 ` [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Alexander Duyck
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM, Ayush Sawal, Andrew Lunn, Eric Dumazet,
	Willem de Bruijn, Jason Wang, Ingo Molnar, Peter Zijlstra,
	Juri Lelli, Vincent Guittot, Dietmar Eggemann, Steven Rostedt,
	Ben Segall, Mel Gorman, Valentin Schneider, Simon Horman,
	John Fastabend, Jakub Sitnicki, David Ahern, Matthieu Baerts,
	Mat Martineau, Geliang Tang, Boris Pismenny, bpf, mptcp

Use the newly introduced prepare/probe/commit API to
replace page_frag with page_frag_cache for sk_page_frag().

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 .../chelsio/inline_crypto/chtls/chtls.h       |   3 -
 .../chelsio/inline_crypto/chtls/chtls_io.c    | 101 +++++-------------
 .../chelsio/inline_crypto/chtls/chtls_main.c  |   3 -
 drivers/net/tun.c                             |  47 ++++----
 include/linux/sched.h                         |   2 +-
 include/net/sock.h                            |  21 ++--
 kernel/exit.c                                 |   3 +-
 kernel/fork.c                                 |   3 +-
 net/core/skbuff.c                             |  58 +++++-----
 net/core/skmsg.c                              |  12 ++-
 net/core/sock.c                               |  32 ++++--
 net/ipv4/ip_output.c                          |  28 +++--
 net/ipv4/tcp.c                                |  23 ++--
 net/ipv4/tcp_output.c                         |  25 +++--
 net/ipv6/ip6_output.c                         |  28 +++--
 net/kcm/kcmsock.c                             |  18 ++--
 net/mptcp/protocol.c                          |  47 ++++----
 net/tls/tls_device.c                          | 100 ++++++++++-------
 18 files changed, 293 insertions(+), 261 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
index 21e0dfeff158..85ce0b2f1f3f 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls.h
@@ -234,7 +234,6 @@ struct chtls_dev {
 	struct list_head list_node;
 	struct list_head rcu_node;
 	struct list_head na_node;
-	unsigned int send_page_order;
 	int max_host_sndbuf;
 	u32 round_robin_cnt;
 	struct key_map kmap;
@@ -453,8 +452,6 @@ enum {
 
 /* The ULP mode/submode of an skbuff */
 #define skb_ulp_mode(skb)  (ULP_SKB_CB(skb)->ulp_mode)
-#define TCP_PAGE(sk)   (sk->sk_frag.page)
-#define TCP_OFF(sk)    (sk->sk_frag.offset)
 
 static inline struct chtls_dev *to_chtls_dev(struct tls_toe_device *tlsdev)
 {
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
index d567e42e1760..7b1760ab55ba 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_io.c
@@ -825,12 +825,6 @@ void skb_entail(struct sock *sk, struct sk_buff *skb, int flags)
 	ULP_SKB_CB(skb)->flags = flags;
 	__skb_queue_tail(&csk->txq, skb);
 	sk->sk_wmem_queued += skb->truesize;
-
-	if (TCP_PAGE(sk) && TCP_OFF(sk)) {
-		put_page(TCP_PAGE(sk));
-		TCP_PAGE(sk) = NULL;
-		TCP_OFF(sk) = 0;
-	}
 }
 
 static struct sk_buff *get_tx_skb(struct sock *sk, int size)
@@ -882,16 +876,12 @@ static void push_frames_if_head(struct sock *sk)
 		chtls_push_frames(csk, 1);
 }
 
-static int chtls_skb_copy_to_page_nocache(struct sock *sk,
-					  struct iov_iter *from,
-					  struct sk_buff *skb,
-					  struct page *page,
-					  int off, int copy)
+static int chtls_skb_copy_to_va_nocache(struct sock *sk, struct iov_iter *from,
+					struct sk_buff *skb, char *va, int copy)
 {
 	int err;
 
-	err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) +
-				       off, copy, skb->len);
+	err = skb_do_copy_data_nocache(sk, skb, from, va, copy, skb->len);
 	if (err)
 		return err;
 
@@ -1114,82 +1104,45 @@ int chtls_sendmsg(struct sock *sk, struct msghdr *msg, size_t size)
 			if (err)
 				goto do_fault;
 		} else {
+			struct page_frag_cache *nc = &sk->sk_frag;
+			struct page_frag page_frag, *pfrag;
 			int i = skb_shinfo(skb)->nr_frags;
-			struct page *page = TCP_PAGE(sk);
-			int pg_size = PAGE_SIZE;
-			int off = TCP_OFF(sk);
-			bool merge;
-
-			if (page)
-				pg_size = page_size(page);
-			if (off < pg_size &&
-			    skb_can_coalesce(skb, i, page, off)) {
+			bool merge = false;
+			void *va;
+
+			pfrag = &page_frag;
+			va = page_frag_alloc_refill_prepare(nc, 32U, pfrag,
+							    sk->sk_allocation);
+			if (unlikely(!va))
+				goto wait_for_memory;
+
+			if (skb_can_coalesce(skb, i, pfrag->page,
+					     pfrag->offset))
 				merge = true;
-				goto copy;
-			}
-			merge = false;
-			if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
-			    MAX_SKB_FRAGS))
+			else if (i == (is_tls_tx(csk) ? (MAX_SKB_FRAGS - 1) :
+				       MAX_SKB_FRAGS))
 				goto new_buf;
 
-			if (page && off == pg_size) {
-				put_page(page);
-				TCP_PAGE(sk) = page = NULL;
-				pg_size = PAGE_SIZE;
-			}
-
-			if (!page) {
-				gfp_t gfp = sk->sk_allocation;
-				int order = cdev->send_page_order;
-
-				if (order) {
-					page = alloc_pages(gfp | __GFP_COMP |
-							   __GFP_NOWARN |
-							   __GFP_NORETRY,
-							   order);
-					if (page)
-						pg_size <<= order;
-				}
-				if (!page) {
-					page = alloc_page(gfp);
-					pg_size = PAGE_SIZE;
-				}
-				if (!page)
-					goto wait_for_memory;
-				off = 0;
-			}
-copy:
-			if (copy > pg_size - off)
-				copy = pg_size - off;
+			copy = min_t(int, copy, pfrag->size);
 			if (is_tls_tx(csk))
 				copy = min_t(int, copy, csk->tlshws.txleft);
 
-			err = chtls_skb_copy_to_page_nocache(sk, &msg->msg_iter,
-							     skb, page,
-							     off, copy);
-			if (unlikely(err)) {
-				if (!TCP_PAGE(sk)) {
-					TCP_PAGE(sk) = page;
-					TCP_OFF(sk) = 0;
-				}
+			err = chtls_skb_copy_to_va_nocache(sk, &msg->msg_iter,
+							   skb, va, copy);
+			if (unlikely(err))
 				goto do_fault;
-			}
+
 			/* Update the skb. */
 			if (merge) {
 				skb_frag_size_add(
 						&skb_shinfo(skb)->frags[i - 1],
 						copy);
+				page_frag_refill_commit_noref(nc, pfrag, copy);
 			} else {
-				skb_fill_page_desc(skb, i, page, off, copy);
-				if (off + copy < pg_size) {
-					/* space left keep page */
-					get_page(page);
-					TCP_PAGE(sk) = page;
-				} else {
-					TCP_PAGE(sk) = NULL;
-				}
+				skb_fill_page_desc(skb, i, pfrag->page,
+						   pfrag->offset, copy);
+				page_frag_refill_commit(nc, pfrag, copy);
 			}
-			TCP_OFF(sk) = off + copy;
 		}
 		if (unlikely(skb->len == mss))
 			tx_skb_finalize(skb);
diff --git a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
index 96fd31d75dfd..7284269174c5 100644
--- a/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
+++ b/drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_main.c
@@ -34,7 +34,6 @@ static DEFINE_MUTEX(notify_mutex);
 static RAW_NOTIFIER_HEAD(listen_notify_list);
 static struct proto chtls_cpl_prot, chtls_cpl_protv6;
 struct request_sock_ops chtls_rsk_ops, chtls_rsk_opsv6;
-static uint send_page_order = (14 - PAGE_SHIFT < 0) ? 0 : 14 - PAGE_SHIFT;
 
 static void register_listen_notifier(struct notifier_block *nb)
 {
@@ -273,8 +272,6 @@ static void *chtls_uld_add(const struct cxgb4_lld_info *info)
 	INIT_WORK(&cdev->deferq_task, process_deferq);
 	spin_lock_init(&cdev->listen_lock);
 	spin_lock_init(&cdev->idr_lock);
-	cdev->send_page_order = min_t(uint, get_order(32768),
-				      send_page_order);
 	cdev->max_host_sndbuf = 48 * 1024;
 
 	if (lldi->vr->key.size)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index d7a865ef370b..4ca6590ef5fe 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1599,21 +1599,19 @@ static bool tun_can_build_skb(struct tun_struct *tun, struct tun_file *tfile,
 }
 
 static struct sk_buff *__tun_build_skb(struct tun_file *tfile,
-				       struct page_frag *alloc_frag, char *buf,
-				       int buflen, int len, int pad)
+				       char *buf, int buflen, int len, int pad)
 {
 	struct sk_buff *skb = build_skb(buf, buflen);
 
-	if (!skb)
+	if (!skb) {
+		page_frag_free(buf);
 		return ERR_PTR(-ENOMEM);
+	}
 
 	skb_reserve(skb, pad);
 	skb_put(skb, len);
 	skb_set_owner_w(skb, tfile->socket.sk);
 
-	get_page(alloc_frag->page);
-	alloc_frag->offset += buflen;
-
 	return skb;
 }
 
@@ -1661,8 +1659,8 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 				     struct virtio_net_hdr *hdr,
 				     int len, int *skb_xdp)
 {
-	struct page_frag *alloc_frag = &current->task_frag;
 	struct bpf_net_context __bpf_net_ctx, *bpf_net_ctx;
+	struct page_frag_cache *nc = &current->task_frag;
 	struct bpf_prog *xdp_prog;
 	int buflen = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 	char *buf;
@@ -1677,16 +1675,16 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	buflen += SKB_DATA_ALIGN(len + pad);
 	rcu_read_unlock();
 
-	alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
-	if (unlikely(!skb_page_frag_refill(buflen, alloc_frag, GFP_KERNEL)))
+	buf = page_frag_alloc_align(nc, buflen, GFP_KERNEL,
+				    SMP_CACHE_BYTES);
+	if (unlikely(!buf))
 		return ERR_PTR(-ENOMEM);
 
-	buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
-	copied = copy_page_from_iter(alloc_frag->page,
-				     alloc_frag->offset + pad,
-				     len, from);
-	if (copied != len)
+	copied = copy_from_iter(buf + pad, len, from);
+	if (copied != len) {
+		page_frag_alloc_abort(nc, buf, buflen);
 		return ERR_PTR(-EFAULT);
+	}
 
 	/* There's a small window that XDP may be set after the check
 	 * of xdp_prog above, this should be rare and for simplicity
@@ -1694,8 +1692,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	 */
 	if (hdr->gso_type || !xdp_prog) {
 		*skb_xdp = 1;
-		return __tun_build_skb(tfile, alloc_frag, buf, buflen, len,
-				       pad);
+		return __tun_build_skb(tfile, buf, buflen, len, pad);
 	}
 
 	*skb_xdp = 0;
@@ -1712,21 +1709,23 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		xdp_prepare_buff(&xdp, buf, pad, len, false);
 
 		act = bpf_prog_run_xdp(xdp_prog, &xdp);
-		if (act == XDP_REDIRECT || act == XDP_TX) {
-			get_page(alloc_frag->page);
-			alloc_frag->offset += buflen;
-		}
 		err = tun_xdp_act(tun, xdp_prog, &xdp, act);
 		if (err < 0) {
-			if (act == XDP_REDIRECT || act == XDP_TX)
-				put_page(alloc_frag->page);
+			if (act == XDP_REDIRECT || act == XDP_TX) {
+				page_frag_alloc_abort_ref(nc, buf, buflen);
+				goto out;
+			}
+
+			page_frag_alloc_abort(nc, buf, buflen);
 			goto out;
 		}
 
 		if (err == XDP_REDIRECT)
 			xdp_do_flush();
-		if (err != XDP_PASS)
+		if (err != XDP_PASS) {
+			page_frag_alloc_abort(nc, buf, buflen);
 			goto out;
+		}
 
 		pad = xdp.data - xdp.data_hard_start;
 		len = xdp.data_end - xdp.data;
@@ -1735,7 +1734,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 	rcu_read_unlock();
 	local_bh_enable();
 
-	return __tun_build_skb(tfile, alloc_frag, buf, buflen, len, pad);
+	return __tun_build_skb(tfile, buf, buflen, len, pad);
 
 out:
 	bpf_net_ctx_clear(bpf_net_ctx);
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d380bffee2ef..73c425bac58d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1382,7 +1382,7 @@ struct task_struct {
 	/* Cache last used pipe for splice(): */
 	struct pipe_inode_info		*splice_pipe;
 
-	struct page_frag		task_frag;
+	struct page_frag_cache		task_frag;
 
 #ifdef CONFIG_TASK_DELAY_ACCT
 	struct task_delay_info		*delays;
diff --git a/include/net/sock.h b/include/net/sock.h
index cf037c870e3b..9b24f53c29e7 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -303,7 +303,7 @@ struct sk_filter;
   *	@sk_stamp: time stamp of last packet received
   *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
   *	@sk_tsflags: SO_TIMESTAMPING flags
-  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
+  *	@sk_use_task_frag: allow sk_page_frag_cache() to use current->task_frag.
   *			   Sockets that can be used under memory reclaim should
   *			   set this to false.
   *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
@@ -462,7 +462,7 @@ struct sock {
 	struct sk_buff_head	sk_write_queue;
 	u32			sk_dst_pending_confirm;
 	u32			sk_pacing_status; /* see enum sk_pacing */
-	struct page_frag	sk_frag;
+	struct page_frag_cache	sk_frag;
 	struct timer_list	sk_timer;
 
 	unsigned long		sk_pacing_rate; /* bytes per second */
@@ -2491,22 +2491,22 @@ static inline void sk_stream_moderate_sndbuf(struct sock *sk)
 }
 
 /**
- * sk_page_frag - return an appropriate page_frag
+ * sk_page_frag_cache - return an appropriate page_frag_cache
  * @sk: socket
  *
- * Use the per task page_frag instead of the per socket one for
+ * Use the per task page_frag_cache instead of the per socket one for
  * optimization when we know that we're in process context and own
  * everything that's associated with %current.
  *
  * Both direct reclaim and page faults can nest inside other
- * socket operations and end up recursing into sk_page_frag()
- * while it's already in use: explicitly avoid task page_frag
+ * socket operations and end up recursing into sk_page_frag_cache()
+ * while it's already in use: explicitly avoid task page_frag_cache
  * when users disable sk_use_task_frag.
  *
  * Return: a per task page_frag if context allows that,
  * otherwise a per socket one.
  */
-static inline struct page_frag *sk_page_frag(struct sock *sk)
+static inline struct page_frag_cache *sk_page_frag_cache(struct sock *sk)
 {
 	if (sk->sk_use_task_frag)
 		return &current->task_frag;
@@ -2514,7 +2514,12 @@ static inline struct page_frag *sk_page_frag(struct sock *sk)
 	return &sk->sk_frag;
 }
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
+bool sk_page_frag_refill_prepare(struct sock *sk, struct page_frag_cache *nc,
+				 struct page_frag *pfrag);
+
+void *sk_page_frag_alloc_refill_prepare(struct sock *sk,
+					struct page_frag_cache *nc,
+					struct page_frag *pfrag);
 
 /*
  *	Default write policy as shown to user space via poll/select/SIGIO
diff --git a/kernel/exit.c b/kernel/exit.c
index 1dcddfe537ee..010dc4a05dc5 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -973,8 +973,7 @@ void __noreturn do_exit(long code)
 	if (tsk->splice_pipe)
 		free_pipe_info(tsk->splice_pipe);
 
-	if (tsk->task_frag.page)
-		put_page(tsk->task_frag.page);
+	page_frag_cache_drain(&tsk->task_frag);
 
 	exit_task_stack_account(tsk);
 
diff --git a/kernel/fork.c b/kernel/fork.c
index 1450b461d196..a0f7b2d9ce05 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -80,6 +80,7 @@
 #include <linux/tty.h>
 #include <linux/fs_struct.h>
 #include <linux/magic.h>
+#include <linux/page_frag_cache.h>
 #include <linux/perf_event.h>
 #include <linux/posix-timers.h>
 #include <linux/user-return-notifier.h>
@@ -1165,10 +1166,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
 	tsk->btrace_seq = 0;
 #endif
 	tsk->splice_pipe = NULL;
-	tsk->task_frag.page = NULL;
 	tsk->wake_q.next = NULL;
 	tsk->worker_private = NULL;
 
+	page_frag_cache_init(&tsk->task_frag);
 	kcov_task_init(tsk);
 	kmsan_task_create(tsk);
 	kmap_local_fork(tsk);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 6841e61a6bd0..684cd68ca4ab 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -3062,25 +3062,6 @@ static void sock_spd_release(struct splice_pipe_desc *spd, unsigned int i)
 	put_page(spd->pages[i]);
 }
 
-static struct page *linear_to_page(struct page *page, unsigned int *len,
-				   unsigned int *offset,
-				   struct sock *sk)
-{
-	struct page_frag *pfrag = sk_page_frag(sk);
-
-	if (!sk_page_frag_refill(sk, pfrag))
-		return NULL;
-
-	*len = min_t(unsigned int, *len, pfrag->size - pfrag->offset);
-
-	memcpy(page_address(pfrag->page) + pfrag->offset,
-	       page_address(page) + *offset, *len);
-	*offset = pfrag->offset;
-	pfrag->offset += *len;
-
-	return pfrag->page;
-}
-
 static bool spd_can_coalesce(const struct splice_pipe_desc *spd,
 			     struct page *page,
 			     unsigned int offset)
@@ -3091,6 +3072,37 @@ static bool spd_can_coalesce(const struct splice_pipe_desc *spd,
 		 spd->partial[spd->nr_pages - 1].len == offset);
 }
 
+static bool spd_fill_linear_page(struct splice_pipe_desc *spd,
+				 struct page *page, unsigned int offset,
+				 unsigned int *len, struct sock *sk)
+{
+	struct page_frag_cache *nc = sk_page_frag_cache(sk);
+	struct page_frag page_frag, *pfrag;
+	void *va;
+
+	pfrag = &page_frag;
+	va = sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
+	if (!va)
+		return true;
+
+	*len = min_t(unsigned int, *len, pfrag->size);
+	memcpy(va, page_address(page) + offset, *len);
+
+	if (spd_can_coalesce(spd, pfrag->page, pfrag->offset)) {
+		spd->partial[spd->nr_pages - 1].len += *len;
+		page_frag_refill_commit_noref(nc, pfrag, *len);
+		return false;
+	}
+
+	page_frag_refill_commit(nc, pfrag, *len);
+	spd->pages[spd->nr_pages] = pfrag->page;
+	spd->partial[spd->nr_pages].len = *len;
+	spd->partial[spd->nr_pages].offset = pfrag->offset;
+	spd->nr_pages++;
+
+	return false;
+}
+
 /*
  * Fill page/offset/length into spd, if it can hold more pages.
  */
@@ -3103,11 +3115,9 @@ static bool spd_fill_page(struct splice_pipe_desc *spd,
 	if (unlikely(spd->nr_pages == MAX_SKB_FRAGS))
 		return true;
 
-	if (linear) {
-		page = linear_to_page(page, len, &offset, sk);
-		if (!page)
-			return true;
-	}
+	if (linear)
+		return spd_fill_linear_page(spd, page, offset, len,  sk);
+
 	if (spd_can_coalesce(spd, page, offset)) {
 		spd->partial[spd->nr_pages - 1].len += *len;
 		return false;
diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index e90fbab703b2..db53f619e69a 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -27,23 +27,25 @@ static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
 int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 		 int elem_first_coalesce)
 {
-	struct page_frag *pfrag = sk_page_frag(sk);
+	struct page_frag_cache *nc = sk_page_frag_cache(sk);
 	u32 osize = msg->sg.size;
 	int ret = 0;
 
 	len -= msg->sg.size;
 	while (len > 0) {
+		struct page_frag page_frag, *pfrag;
 		struct scatterlist *sge;
 		u32 orig_offset;
 		int use, i;
 
-		if (!sk_page_frag_refill(sk, pfrag)) {
+		pfrag = &page_frag;
+		if (!sk_page_frag_refill_prepare(sk, nc, pfrag)) {
 			ret = -ENOMEM;
 			goto msg_trim;
 		}
 
 		orig_offset = pfrag->offset;
-		use = min_t(int, len, pfrag->size - orig_offset);
+		use = min_t(int, len, pfrag->size);
 		if (!sk_wmem_schedule(sk, use)) {
 			ret = -ENOMEM;
 			goto msg_trim;
@@ -57,6 +59,7 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 		    sg_page(sge) == pfrag->page &&
 		    sge->offset + sge->length == orig_offset) {
 			sge->length += use;
+			page_frag_refill_commit_noref(nc, pfrag, use);
 		} else {
 			if (sk_msg_full(msg)) {
 				ret = -ENOSPC;
@@ -66,13 +69,12 @@ int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
 			sge = &msg->sg.data[msg->sg.end];
 			sg_unmark_end(sge);
 			sg_set_page(sge, pfrag->page, use, orig_offset);
-			get_page(pfrag->page);
+			page_frag_refill_commit(nc, pfrag, use);
 			sk_msg_iter_next(msg, end);
 		}
 
 		sk_mem_charge(sk, use);
 		msg->sg.size += use;
-		pfrag->offset += use;
 		len -= use;
 	}
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 74729d20cd00..c186ef593426 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -2276,10 +2276,7 @@ static void __sk_destruct(struct rcu_head *head)
 		pr_debug("%s: optmem leakage (%d bytes) detected\n",
 			 __func__, atomic_read(&sk->sk_omem_alloc));
 
-	if (sk->sk_frag.page) {
-		put_page(sk->sk_frag.page);
-		sk->sk_frag.page = NULL;
-	}
+	page_frag_cache_drain(&sk->sk_frag);
 
 	/* We do not need to acquire sk->sk_peer_lock, we are the last user. */
 	put_cred(sk->sk_peer_cred);
@@ -3035,16 +3032,33 @@ bool skb_page_frag_refill(unsigned int sz, struct page_frag *pfrag, gfp_t gfp)
 }
 EXPORT_SYMBOL(skb_page_frag_refill);
 
-bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+bool sk_page_frag_refill_prepare(struct sock *sk, struct page_frag_cache *nc,
+				 struct page_frag *pfrag)
 {
-	if (likely(skb_page_frag_refill(32U, pfrag, sk->sk_allocation)))
+	if (likely(page_frag_refill_prepare(nc, 32U, pfrag, sk->sk_allocation)))
 		return true;
 
 	sk_enter_memory_pressure(sk);
 	sk_stream_moderate_sndbuf(sk);
 	return false;
 }
-EXPORT_SYMBOL(sk_page_frag_refill);
+EXPORT_SYMBOL(sk_page_frag_refill_prepare);
+
+void *sk_page_frag_alloc_refill_prepare(struct sock *sk,
+					struct page_frag_cache *nc,
+					struct page_frag *pfrag)
+{
+	void *va;
+
+	va = page_frag_alloc_refill_prepare(nc, 32U, pfrag, sk->sk_allocation);
+	if (likely(va))
+		return va;
+
+	sk_enter_memory_pressure(sk);
+	sk_stream_moderate_sndbuf(sk);
+	return NULL;
+}
+EXPORT_SYMBOL(sk_page_frag_alloc_refill_prepare);
 
 void __lock_sock(struct sock *sk)
 	__releases(&sk->sk_lock.slock)
@@ -3566,8 +3580,8 @@ void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid)
 	sk->sk_error_report	=	sock_def_error_report;
 	sk->sk_destruct		=	sock_def_destruct;
 
-	sk->sk_frag.page	=	NULL;
-	sk->sk_frag.offset	=	0;
+	page_frag_cache_init(&sk->sk_frag);
+
 	sk->sk_peek_off		=	-1;
 
 	sk->sk_peer_pid 	=	NULL;
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index a59204a8d850..c94a428a5e37 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -953,7 +953,7 @@ static int __ip_append_data(struct sock *sk,
 			    struct flowi4 *fl4,
 			    struct sk_buff_head *queue,
 			    struct inet_cork *cork,
-			    struct page_frag *pfrag,
+			    struct page_frag_cache *nc,
 			    int getfrag(void *from, char *to, int offset,
 					int len, int odd, struct sk_buff *skb),
 			    void *from, int length, int transhdrlen,
@@ -1237,13 +1237,19 @@ static int __ip_append_data(struct sock *sk,
 			copy = err;
 			wmem_alloc_delta += copy;
 		} else if (!zc) {
+			struct page_frag page_frag, *pfrag;
 			int i = skb_shinfo(skb)->nr_frags;
+			void *va;
 
 			err = -ENOMEM;
-			if (!sk_page_frag_refill(sk, pfrag))
+			pfrag = &page_frag;
+			va = sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
+			if (!va)
 				goto error;
 
 			skb_zcopy_downgrade_managed(skb);
+			copy = min_t(int, copy, pfrag->size);
+
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
 				err = -EMSGSIZE;
@@ -1251,19 +1257,19 @@ static int __ip_append_data(struct sock *sk,
 					goto error;
 
 				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+						     pfrag->offset, copy);
 				skb_shinfo(skb)->nr_frags = ++i;
-				get_page(pfrag->page);
+				page_frag_refill_commit(nc, pfrag, copy);
+			} else {
+				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1],
+						  copy);
+				page_frag_refill_commit_noref(nc, pfrag, copy);
 			}
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
+
 			if (INDIRECT_CALL_1(getfrag, ip_generic_getfrag,
-				    from,
-				    page_address(pfrag->page) + pfrag->offset,
-				    offset, copy, skb->len, skb) < 0)
+				    from, va, offset, copy, skb->len, skb) < 0)
 				goto error_efault;
 
-			pfrag->offset += copy;
-			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			skb_len_add(skb, copy);
 			wmem_alloc_delta += copy;
 		} else {
@@ -1378,7 +1384,7 @@ int ip_append_data(struct sock *sk, struct flowi4 *fl4,
 	}
 
 	return __ip_append_data(sk, fl4, &sk->sk_write_queue, &inet->cork.base,
-				sk_page_frag(sk), getfrag,
+				sk_page_frag_cache(sk), getfrag,
 				from, length, transhdrlen, flags);
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0fbf1e222cda..24068f949c4f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1193,9 +1193,13 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 		if (zc == 0) {
 			bool merge = true;
 			int i = skb_shinfo(skb)->nr_frags;
-			struct page_frag *pfrag = sk_page_frag(sk);
+			struct page_frag_cache *nc = sk_page_frag_cache(sk);
+			struct page_frag page_frag, *pfrag;
+			void *va;
 
-			if (!sk_page_frag_refill(sk, pfrag))
+			pfrag = &page_frag;
+			va = sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
+			if (!va)
 				goto wait_for_space;
 
 			if (!skb_can_coalesce(skb, i, pfrag->page,
@@ -1207,7 +1211,7 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				merge = false;
 			}
 
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
+			copy = min_t(int, copy, pfrag->size);
 
 			if (unlikely(skb_zcopy_pure(skb) || skb_zcopy_managed(skb))) {
 				if (tcp_downgrade_zcopy_pure(sk, skb))
@@ -1220,20 +1224,19 @@ int tcp_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t size)
 				goto wait_for_space;
 
 			err = skb_copy_to_frag_nocache(sk, &msg->msg_iter, skb,
-						       page_address(pfrag->page) +
-						       pfrag->offset, copy);
+						       va, copy);
 			if (err)
 				goto do_error;
 
 			/* Update the skb. */
 			if (merge) {
 				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_refill_commit_noref(nc, pfrag, copy);
 			} else {
 				skb_fill_page_desc(skb, i, pfrag->page,
 						   pfrag->offset, copy);
-				page_ref_inc(pfrag->page);
+				page_frag_refill_commit(nc, pfrag, copy);
 			}
-			pfrag->offset += copy;
 		} else if (zc == MSG_ZEROCOPY)  {
 			/* First append to a fragless skb builds initial
 			 * pure zerocopy skb
@@ -3393,11 +3396,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 
 	WARN_ON(inet->inet_num && !icsk->icsk_bind_hash);
 
-	if (sk->sk_frag.page) {
-		put_page(sk->sk_frag.page);
-		sk->sk_frag.page = NULL;
-		sk->sk_frag.offset = 0;
-	}
+	page_frag_cache_drain(&sk->sk_frag);
 	sk_error_report(sk);
 	return 0;
 }
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 5485a70b5fe5..d84b0d477a65 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -3968,9 +3968,11 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct tcp_fastopen_request *fo = tp->fastopen_req;
-	struct page_frag *pfrag = sk_page_frag(sk);
+	struct page_frag_cache *nc = sk_page_frag_cache(sk);
+	struct page_frag page_frag, *pfrag;
 	struct sk_buff *syn_data;
 	int space, err = 0;
+	void *va;
 
 	tp->rx_opt.mss_clamp = tp->advmss;  /* If MSS is not cached */
 	if (!tcp_fastopen_cookie_check(sk, &tp->rx_opt.mss_clamp, &fo->cookie))
@@ -3989,21 +3991,25 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 
 	space = min_t(size_t, space, fo->size);
 
-	if (space &&
-	    !skb_page_frag_refill(min_t(size_t, space, PAGE_SIZE),
-				  pfrag, sk->sk_allocation))
-		goto fallback;
+	if (space) {
+		pfrag = &page_frag;
+		va = page_frag_alloc_refill_prepare(nc,
+						    min_t(size_t, space, PAGE_SIZE),
+						    pfrag, sk->sk_allocation);
+		if (!va)
+			goto fallback;
+	}
+
 	syn_data = tcp_stream_alloc_skb(sk, sk->sk_allocation, false);
 	if (!syn_data)
 		goto fallback;
 	memcpy(syn_data->cb, syn->cb, sizeof(syn->cb));
 	if (space) {
-		space = min_t(size_t, space, pfrag->size - pfrag->offset);
+		space = min_t(size_t, space, pfrag->size);
 		space = tcp_wmem_schedule(sk, space);
 	}
 	if (space) {
-		space = copy_page_from_iter(pfrag->page, pfrag->offset,
-					    space, &fo->data->msg_iter);
+		space = _copy_from_iter(va, space, &fo->data->msg_iter);
 		if (unlikely(!space)) {
 			tcp_skb_tsorted_anchor_cleanup(syn_data);
 			kfree_skb(syn_data);
@@ -4011,8 +4017,7 @@ static int tcp_send_syn_data(struct sock *sk, struct sk_buff *syn)
 		}
 		skb_fill_page_desc(syn_data, 0, pfrag->page,
 				   pfrag->offset, space);
-		page_ref_inc(pfrag->page);
-		pfrag->offset += space;
+		page_frag_refill_commit(nc, pfrag, space);
 		skb_len_add(syn_data, space);
 		skb_zcopy_set(syn_data, fo->uarg, NULL);
 	}
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3d672dea9f56..6e11dd8089e4 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -1416,7 +1416,7 @@ static int __ip6_append_data(struct sock *sk,
 			     struct sk_buff_head *queue,
 			     struct inet_cork_full *cork_full,
 			     struct inet6_cork *v6_cork,
-			     struct page_frag *pfrag,
+			     struct page_frag_cache *nc,
 			     int getfrag(void *from, char *to, int offset,
 					 int len, int odd, struct sk_buff *skb),
 			     void *from, size_t length, int transhdrlen,
@@ -1764,13 +1764,19 @@ static int __ip6_append_data(struct sock *sk,
 			copy = err;
 			wmem_alloc_delta += copy;
 		} else if (!zc) {
+			struct page_frag page_frag, *pfrag;
 			int i = skb_shinfo(skb)->nr_frags;
+			void *va;
 
 			err = -ENOMEM;
-			if (!sk_page_frag_refill(sk, pfrag))
+			pfrag = &page_frag;
+			va = sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
+			if (!va)
 				goto error;
 
 			skb_zcopy_downgrade_managed(skb);
+			copy = min_t(int, copy, pfrag->size);
+
 			if (!skb_can_coalesce(skb, i, pfrag->page,
 					      pfrag->offset)) {
 				err = -EMSGSIZE;
@@ -1778,19 +1784,19 @@ static int __ip6_append_data(struct sock *sk,
 					goto error;
 
 				__skb_fill_page_desc(skb, i, pfrag->page,
-						     pfrag->offset, 0);
+						     pfrag->offset, copy);
 				skb_shinfo(skb)->nr_frags = ++i;
-				get_page(pfrag->page);
+				page_frag_refill_commit(nc, pfrag, copy);
+			} else {
+				skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1],
+						  copy);
+				page_frag_refill_commit_noref(nc, pfrag, copy);
 			}
-			copy = min_t(int, copy, pfrag->size - pfrag->offset);
+
 			if (INDIRECT_CALL_1(getfrag, ip_generic_getfrag,
-				    from,
-				    page_address(pfrag->page) + pfrag->offset,
-				    offset, copy, skb->len, skb) < 0)
+				    from, va, offset, copy, skb->len, skb) < 0)
 				goto error_efault;
 
-			pfrag->offset += copy;
-			skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
 			skb->len += copy;
 			skb->data_len += copy;
 			skb->truesize += copy;
@@ -1853,7 +1859,7 @@ int ip6_append_data(struct sock *sk,
 	}
 
 	return __ip6_append_data(sk, &sk->sk_write_queue, &inet->cork,
-				 &np->cork, sk_page_frag(sk), getfrag,
+				 &np->cork, sk_page_frag_cache(sk), getfrag,
 				 from, length, transhdrlen, flags, ipc6);
 }
 EXPORT_SYMBOL_GPL(ip6_append_data);
diff --git a/net/kcm/kcmsock.c b/net/kcm/kcmsock.c
index 94719d4af5fa..8f241a7173ed 100644
--- a/net/kcm/kcmsock.c
+++ b/net/kcm/kcmsock.c
@@ -804,9 +804,13 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 	while (msg_data_left(msg)) {
 		bool merge = true;
 		int i = skb_shinfo(skb)->nr_frags;
-		struct page_frag *pfrag = sk_page_frag(sk);
+		struct page_frag_cache *nc = sk_page_frag_cache(sk);
+		struct page_frag page_frag, *pfrag;
+		void *va;
 
-		if (!sk_page_frag_refill(sk, pfrag))
+		pfrag = &page_frag;
+		va = sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
+		if (!va)
 			goto wait_for_memory;
 
 		if (!skb_can_coalesce(skb, i, pfrag->page,
@@ -851,14 +855,12 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 			if (head != skb)
 				head->truesize += copy;
 		} else {
-			copy = min_t(int, msg_data_left(msg),
-				     pfrag->size - pfrag->offset);
+			copy = min_t(int, msg_data_left(msg), pfrag->size);
 			if (!sk_wmem_schedule(sk, copy))
 				goto wait_for_memory;
 
 			err = skb_copy_to_frag_nocache(sk, &msg->msg_iter, skb,
-						       page_address(pfrag->page) +
-						       pfrag->offset, copy);
+						       va, copy);
 			if (err)
 				goto out_error;
 
@@ -866,13 +868,13 @@ static int kcm_sendmsg(struct socket *sock, struct msghdr *msg, size_t len)
 			if (merge) {
 				skb_frag_size_add(
 					&skb_shinfo(skb)->frags[i - 1], copy);
+				page_frag_refill_commit_noref(nc, pfrag, copy);
 			} else {
 				skb_fill_page_desc(skb, i, pfrag->page,
 						   pfrag->offset, copy);
-				get_page(pfrag->page);
+				page_frag_refill_commit(nc, pfrag, copy);
 			}
 
-			pfrag->offset += copy;
 		}
 
 		copied += copy;
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 08a72242428c..815d4e48a44e 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -978,7 +978,6 @@ static bool mptcp_skb_can_collapse_to(u64 write_seq,
 }
 
 /* we can append data to the given data frag if:
- * - there is space available in the backing page_frag
  * - the data frag tail matches the current page_frag free offset
  * - the data frag end sequence number matches the current write seq
  */
@@ -987,7 +986,6 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk,
 				       const struct mptcp_data_frag *df)
 {
 	return df && pfrag->page == df->page &&
-		pfrag->size - pfrag->offset > 0 &&
 		pfrag->offset == (df->offset + df->data_len) &&
 		df->data_seq + df->data_len == msk->write_seq;
 }
@@ -1103,14 +1101,20 @@ static void mptcp_enter_memory_pressure(struct sock *sk)
 /* ensure we get enough memory for the frag hdr, beyond some minimal amount of
  * data
  */
-static bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag)
+static void *mptcp_page_frag_alloc_refill_prepare(struct sock *sk,
+						  struct page_frag_cache *nc,
+						  struct page_frag *pfrag)
 {
-	if (likely(skb_page_frag_refill(32U + sizeof(struct mptcp_data_frag),
-					pfrag, sk->sk_allocation)))
-		return true;
+	unsigned int fragsz = 32U + sizeof(struct mptcp_data_frag);
+	void *va;
+
+	va = page_frag_alloc_refill_prepare(nc, fragsz, pfrag,
+					    sk->sk_allocation);
+	if (likely(va))
+		return va;
 
 	mptcp_enter_memory_pressure(sk);
-	return false;
+	return NULL;
 }
 
 static struct mptcp_data_frag *
@@ -1813,7 +1817,7 @@ static u32 mptcp_send_limit(const struct sock *sk)
 static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 {
 	struct mptcp_sock *msk = mptcp_sk(sk);
-	struct page_frag *pfrag;
+	struct page_frag_cache *nc;
 	size_t copied = 0;
 	int ret = 0;
 	long timeo;
@@ -1847,14 +1851,16 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	if (unlikely(sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN)))
 		goto do_error;
 
-	pfrag = sk_page_frag(sk);
+	nc = sk_page_frag_cache(sk);
 
 	while (msg_data_left(msg)) {
+		struct page_frag page_frag, *pfrag;
 		int total_ts, frag_truesize = 0;
 		struct mptcp_data_frag *dfrag;
 		bool dfrag_collapsed;
-		size_t psize, offset;
 		u32 copy_limit;
+		size_t psize;
+		void *va;
 
 		/* ensure fitting the notsent_lowat() constraint */
 		copy_limit = mptcp_send_limit(sk);
@@ -1865,21 +1871,26 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 * page allocator
 		 */
 		dfrag = mptcp_pending_tail(sk);
-		dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag);
+		pfrag = &page_frag;
+		va = page_frag_alloc_refill_probe(nc, 1, pfrag);
+		dfrag_collapsed = va && mptcp_frag_can_collapse_to(msk, pfrag,
+								   dfrag);
 		if (!dfrag_collapsed) {
-			if (!mptcp_page_frag_refill(sk, pfrag))
+			va = mptcp_page_frag_alloc_refill_prepare(sk, nc,
+								  pfrag);
+			if (!va)
 				goto wait_for_memory;
 
 			dfrag = mptcp_carve_data_frag(msk, pfrag, pfrag->offset);
 			frag_truesize = dfrag->overhead;
+			va += dfrag->overhead;
 		}
 
 		/* we do not bound vs wspace, to allow a single packet.
 		 * memory accounting will prevent execessive memory usage
 		 * anyway
 		 */
-		offset = dfrag->offset + dfrag->data_len;
-		psize = pfrag->size - offset;
+		psize = pfrag->size - frag_truesize;
 		psize = min_t(size_t, psize, msg_data_left(msg));
 		psize = min_t(size_t, psize, copy_limit);
 		total_ts = psize + frag_truesize;
@@ -1887,8 +1898,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		if (!sk_wmem_schedule(sk, total_ts))
 			goto wait_for_memory;
 
-		ret = do_copy_data_nocache(sk, psize, &msg->msg_iter,
-					   page_address(dfrag->page) + offset);
+		ret = do_copy_data_nocache(sk, psize, &msg->msg_iter, va);
 		if (ret)
 			goto do_error;
 
@@ -1897,7 +1907,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		copied += psize;
 		dfrag->data_len += psize;
 		frag_truesize += psize;
-		pfrag->offset += frag_truesize;
 		WRITE_ONCE(msk->write_seq, msk->write_seq + psize);
 
 		/* charge data on mptcp pending queue to the msk socket
@@ -1905,10 +1914,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 		 */
 		sk_wmem_queued_add(sk, frag_truesize);
 		if (!dfrag_collapsed) {
-			get_page(dfrag->page);
+			page_frag_refill_commit(nc, pfrag, frag_truesize);
 			list_add_tail(&dfrag->list, &msk->rtx_queue);
 			if (!msk->first_pending)
 				WRITE_ONCE(msk->first_pending, dfrag);
+		} else {
+			page_frag_refill_commit_noref(nc, pfrag, frag_truesize);
 		}
 		pr_debug("msk=%p dfrag at seq=%llu len=%u sent=%u new=%d\n", msk,
 			 dfrag->data_seq, dfrag->data_len, dfrag->already_sent,
diff --git a/net/tls/tls_device.c b/net/tls/tls_device.c
index dc063c2c7950..0f020293fe10 100644
--- a/net/tls/tls_device.c
+++ b/net/tls/tls_device.c
@@ -253,8 +253,8 @@ static void tls_device_resync_tx(struct sock *sk, struct tls_context *tls_ctx,
 }
 
 static void tls_append_frag(struct tls_record_info *record,
-			    struct page_frag *pfrag,
-			    int size)
+			    struct page_frag_cache *nc,
+			    struct page_frag *pfrag, int size)
 {
 	skb_frag_t *frag;
 
@@ -262,15 +262,34 @@ static void tls_append_frag(struct tls_record_info *record,
 	if (skb_frag_page(frag) == pfrag->page &&
 	    skb_frag_off(frag) + skb_frag_size(frag) == pfrag->offset) {
 		skb_frag_size_add(frag, size);
+		page_frag_refill_commit_noref(nc, pfrag, size);
 	} else {
 		++frag;
 		skb_frag_fill_page_desc(frag, pfrag->page, pfrag->offset,
 					size);
 		++record->num_frags;
+		page_frag_refill_commit(nc, pfrag, size);
+	}
+
+	record->len += size;
+}
+
+static void tls_append_dummy_frag(struct tls_record_info *record,
+				  struct page_frag *pfrag, int size)
+{
+	skb_frag_t *frag;
+
+	frag = &record->frags[record->num_frags - 1];
+	if (skb_frag_page(frag) == pfrag->page &&
+	    skb_frag_off(frag) + skb_frag_size(frag) == pfrag->offset) {
+		skb_frag_size_add(frag, size);
+	} else {
+		++frag;
+		skb_frag_fill_page_desc(frag, pfrag->page, pfrag->offset, size);
+		++record->num_frags;
 		get_page(pfrag->page);
 	}
 
-	pfrag->offset += size;
 	record->len += size;
 }
 
@@ -311,11 +330,11 @@ static int tls_push_record(struct sock *sk,
 static void tls_device_record_close(struct sock *sk,
 				    struct tls_context *ctx,
 				    struct tls_record_info *record,
-				    struct page_frag *pfrag,
+				    struct page_frag_cache *nc,
 				    unsigned char record_type)
 {
 	struct tls_prot_info *prot = &ctx->prot_info;
-	struct page_frag dummy_tag_frag;
+	struct page_frag dummy_tag_frag, *pfrag;
 
 	/* append tag
 	 * device will fill in the tag, we just need to append a placeholder
@@ -323,13 +342,16 @@ static void tls_device_record_close(struct sock *sk,
 	 * increases frag count)
 	 * if we can't allocate memory now use the dummy page
 	 */
-	if (unlikely(pfrag->size - pfrag->offset < prot->tag_size) &&
-	    !skb_page_frag_refill(prot->tag_size, pfrag, sk->sk_allocation)) {
+	pfrag = &dummy_tag_frag;
+	if (unlikely(!page_frag_refill_probe(nc, prot->tag_size, pfrag) &&
+		     !page_frag_refill_prepare(nc, prot->tag_size, pfrag,
+					       sk->sk_allocation))) {
 		dummy_tag_frag.page = dummy_page;
 		dummy_tag_frag.offset = 0;
-		pfrag = &dummy_tag_frag;
+		tls_append_dummy_frag(record, pfrag, prot->tag_size);
+	} else {
+		tls_append_frag(record, nc, pfrag, prot->tag_size);
 	}
-	tls_append_frag(record, pfrag, prot->tag_size);
 
 	/* fill prepend */
 	tls_fill_prepend(ctx, skb_frag_address(&record->frags[0]),
@@ -338,6 +360,7 @@ static void tls_device_record_close(struct sock *sk,
 }
 
 static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
+				 struct page_frag_cache *nc,
 				 struct page_frag *pfrag,
 				 size_t prepend_size)
 {
@@ -352,8 +375,7 @@ static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
 	skb_frag_fill_page_desc(frag, pfrag->page, pfrag->offset,
 				prepend_size);
 
-	get_page(pfrag->page);
-	pfrag->offset += prepend_size;
+	page_frag_refill_commit(nc, pfrag, prepend_size);
 
 	record->num_frags = 1;
 	record->len = prepend_size;
@@ -361,33 +383,34 @@ static int tls_create_new_record(struct tls_offload_context_tx *offload_ctx,
 	return 0;
 }
 
-static int tls_do_allocation(struct sock *sk,
-			     struct tls_offload_context_tx *offload_ctx,
-			     struct page_frag *pfrag,
-			     size_t prepend_size)
+static void *tls_do_allocation(struct sock *sk,
+			       struct tls_offload_context_tx *offload_ctx,
+			       struct page_frag_cache *nc,
+			       size_t prepend_size, struct page_frag *pfrag)
 {
 	int ret;
 
 	if (!offload_ctx->open_record) {
-		if (unlikely(!skb_page_frag_refill(prepend_size, pfrag,
-						   sk->sk_allocation))) {
+		void *va;
+
+		if (unlikely(!page_frag_refill_prepare(nc, prepend_size, pfrag,
+						       sk->sk_allocation))) {
 			READ_ONCE(sk->sk_prot)->enter_memory_pressure(sk);
 			sk_stream_moderate_sndbuf(sk);
-			return -ENOMEM;
+			return NULL;
 		}
 
-		ret = tls_create_new_record(offload_ctx, pfrag, prepend_size);
+		ret = tls_create_new_record(offload_ctx, nc, pfrag,
+					    prepend_size);
 		if (ret)
-			return ret;
+			return NULL;
 
-		if (pfrag->size > pfrag->offset)
-			return 0;
+		va = page_frag_alloc_refill_probe(nc, 1, pfrag);
+		if (va)
+			return va;
 	}
 
-	if (!sk_page_frag_refill(sk, pfrag))
-		return -ENOMEM;
-
-	return 0;
+	return sk_page_frag_alloc_refill_prepare(sk, nc, pfrag);
 }
 
 static int tls_device_copy_data(void *addr, size_t bytes, struct iov_iter *i)
@@ -424,8 +447,8 @@ static int tls_push_data(struct sock *sk,
 	struct tls_prot_info *prot = &tls_ctx->prot_info;
 	struct tls_offload_context_tx *ctx = tls_offload_ctx_tx(tls_ctx);
 	struct tls_record_info *record;
+	struct page_frag_cache *nc;
 	int tls_push_record_flags;
-	struct page_frag *pfrag;
 	size_t orig_size = size;
 	u32 max_open_record_len;
 	bool more = false;
@@ -454,7 +477,7 @@ static int tls_push_data(struct sock *sk,
 			return rc;
 	}
 
-	pfrag = sk_page_frag(sk);
+	nc = sk_page_frag_cache(sk);
 
 	/* TLS_HEADER_SIZE is not counted as part of the TLS record, and
 	 * we need to leave room for an authentication tag.
@@ -462,8 +485,12 @@ static int tls_push_data(struct sock *sk,
 	max_open_record_len = TLS_MAX_PAYLOAD_SIZE +
 			      prot->prepend_size;
 	do {
-		rc = tls_do_allocation(sk, ctx, pfrag, prot->prepend_size);
-		if (unlikely(rc)) {
+		struct page_frag page_frag, *pfrag;
+		void *va;
+
+		pfrag = &page_frag;
+		va = tls_do_allocation(sk, ctx, nc, prot->prepend_size, pfrag);
+		if (unlikely(!va)) {
 			rc = sk_stream_wait_memory(sk, &timeo);
 			if (!rc)
 				continue;
@@ -512,16 +539,15 @@ static int tls_push_data(struct sock *sk,
 
 			zc_pfrag.offset = off;
 			zc_pfrag.size = copy;
-			tls_append_frag(record, &zc_pfrag, copy);
+			tls_append_dummy_frag(record, &zc_pfrag, copy);
 		} else if (copy) {
-			copy = min_t(size_t, copy, pfrag->size - pfrag->offset);
+			copy = min_t(size_t, copy, pfrag->size);
 
-			rc = tls_device_copy_data(page_address(pfrag->page) +
-						  pfrag->offset, copy,
-						  iter);
+			rc = tls_device_copy_data(va, copy, iter);
 			if (rc)
 				goto handle_error;
-			tls_append_frag(record, pfrag, copy);
+
+			tls_append_frag(record, nc, pfrag, copy);
 		}
 
 		size -= copy;
@@ -539,7 +565,7 @@ static int tls_push_data(struct sock *sk,
 		if (done || record->len >= max_open_record_len ||
 		    (record->num_frags >= MAX_SKB_FRAGS - 1)) {
 			tls_device_record_close(sk, tls_ctx, record,
-						pfrag, record_type);
+						nc, record_type);
 
 			rc = tls_push_record(sk,
 					     tls_ctx,
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH net-next v2 10/10] mm: page_frag: add an entry in MAINTAINERS for page_frag
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (8 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 09/10] net: replace page_frag with page_frag_cache Yunsheng Lin
@ 2024-12-06 12:25 ` Yunsheng Lin
  2024-12-08 21:34 ` [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Alexander Duyck
  10 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-06 12:25 UTC (permalink / raw)
  To: davem, kuba, pabeni
  Cc: netdev, linux-kernel, Yunsheng Lin, Alexander Duyck,
	Andrew Morton, Linux-MM

After this patchset, page_frag is a small subsystem/library
on its own, so add an entry in MAINTAINERS to indicate the
new subsystem/library's maintainer, maillist, status and file
lists of page_frag.

Alexander is the original author of page_frag, add him in the
MAINTAINERS too.

CC: Alexander Duyck <alexander.duyck@gmail.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Linux-MM <linux-mm@kvack.org>
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
---
 MAINTAINERS | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 0456a33ef657..7d3725bc40aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17585,6 +17585,18 @@ F:	mm/page-writeback.c
 F:	mm/readahead.c
 F:	mm/truncate.c
 
+PAGE FRAG
+M:	Alexander Duyck <alexander.duyck@gmail.com>
+M:	Yunsheng Lin <linyunsheng@huawei.com>
+L:	linux-mm@kvack.org
+L:	netdev@vger.kernel.org
+S:	Supported
+F:	Documentation/mm/page_frags.rst
+F:	include/linux/page_frag_cache.h
+F:	mm/page_frag_cache.c
+F:	tools/testing/selftests/mm/page_frag/
+F:	tools/testing/selftests/mm/test_page_frag.sh
+
 PAGE POOL
 M:	Jesper Dangaard Brouer <hawk@kernel.org>
 M:	Ilias Apalodimas <ilias.apalodimas@linaro.org>
-- 
2.33.0



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
                   ` (9 preceding siblings ...)
  2024-12-06 12:25 ` [PATCH net-next v2 10/10] mm: page_frag: add an entry in MAINTAINERS for page_frag Yunsheng Lin
@ 2024-12-08 21:34 ` Alexander Duyck
  2024-12-09 11:42   ` Yunsheng Lin
  10 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2024-12-08 21:34 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On Fri, Dec 6, 2024 at 4:32 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> This is part 2 of "Replace page_frag with page_frag_cache",
> which introduces the new API and replaces page_frag with
> page_frag_cache for sk_page_frag().
>
> The part 1 of "Replace page_frag with page_frag_cache" is in
> [1].
>
> After [2], there are still two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
>    rx part with 'struct page_frag_cache' and the main API
>    being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
>    tx part with 'struct page_frag' and the main API being
>    skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now as pcp
> is also supported for high-order pages:
> commit 44042b449872 ("mm/page_alloc: allow high-order pages to
> be stored on the per-cpu lists")
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset:
> 1. Unify the page frag implementation by taking the best out of
>    two the existing implementations: we are able to save some space
>    for the 'page_frag_cache' API user, and avoid 'get_page()' for
>    the old 'page_frag' API user.
> 2. Future bugfix and performance can be done in one place, hence
>    improving maintainability of page_frag's implementation.
>
> Performance validation for part2:
> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>    non-aligned API performance impact for the existing users, there
>    seems to be about 20% performance degradation for refactoring
>    page_frag to support the new API, which seems to nullify most of
>    the performance gain in [3] of part1.

So if I am understanding correctly then this is showing a 20%
performance degradation with this patchset. I would argue that it is
significant enough that it would be a blocking factor for this patch
set. I would suggest bisecting the patch set to identify where the
performance degradation has been added and see what we can do to
resolve it, and if nothing else document it in that patch so we can
identify the root cause for the slowdown.

> 2. Use the below netcat test case, there seems to be some minor
>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>    using the new page_frag API after this patchset.
>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234

This test would barely touch the page pool. The fact is most of the
overhead for this would likely be things like TCP latency and data
copy much more than the page allocation. As such fluctuations here are
likely not related to your changes.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-08 21:34 ` [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Alexander Duyck
@ 2024-12-09 11:42   ` Yunsheng Lin
  2024-12-09 16:03     ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-09 11:42 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On 2024/12/9 5:34, Alexander Duyck wrote:

...

>>
>> Performance validation for part2:
>> 1. Using micro-benchmark ko added in patch 1 to test aligned and
>>    non-aligned API performance impact for the existing users, there
>>    seems to be about 20% performance degradation for refactoring
>>    page_frag to support the new API, which seems to nullify most of
>>    the performance gain in [3] of part1.
> 
> So if I am understanding correctly then this is showing a 20%
> performance degradation with this patchset. I would argue that it is
> significant enough that it would be a blocking factor for this patch
> set. I would suggest bisecting the patch set to identify where the
> performance degradation has been added and see what we can do to
> resolve it, and if nothing else document it in that patch so we can
> identify the root cause for the slowdown.

The only patch in this patchset affecting the performance of existing API
seems to be patch 1, only including patch 1 does show ~20% performance
degradation as including the whole patchset does:
mm: page_frag: some minor refactoring before adding new API

And the cause seems to be about the binary increasing as below, as the
performance degradation didn't seems to change much when I tried inlining
the __page_frag_cache_commit_noref() by moving it to the header file:

./scripts/bloat-o-meter vmlinux_orig vmlinux
add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
Function                                     old     new   delta
__page_frag_cache_prepare                      -     500    +500
__napi_alloc_frag_align                       68     180    +112
__netdev_alloc_skb                           488     596    +108
napi_alloc_skb                               556     624     +68
__netdev_alloc_frag_align                    196     252     +56
svc_tcp_sendmsg                              340     376     +36
__page_frag_cache_commit_noref                 -      32     +32
e843419@09a6_0000bd47_30                       -       8      +8
e843419@0369_000044ee_684                      8       -      -8
__page_frag_alloc_align                      492       -    -492
Total: Before=34719207, After=34719627, chg +0.00%

./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
Function                                     old     new   delta
page_frag_push_thread                        508     580     +72
__UNIQUE_ID_vermagic367                       67      73      +6
Total: Before=4582, After=4660, chg +1.70%

Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
new API can make use of them as much as possible.

Any better idea to reuse common codes as much as possible while avoiding
the performance degradation as much as possible?

> 
>> 2. Use the below netcat test case, there seems to be some minor
>>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>>    using the new page_frag API after this patchset.
>>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> 
> This test would barely touch the page pool. The fact is most of the

I am guessing you meant page_frag here?

> overhead for this would likely be things like TCP latency and data
> copy much more than the page allocation. As such fluctuations here are
> likely not related to your changes.

But it does tell us something that the replacing does not seems to
cause obvious regression, right?

I tried using a smaller MTU to amplify the impact of page allocation,
it seemed to have a similar result.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-09 11:42   ` Yunsheng Lin
@ 2024-12-09 16:03     ` Alexander Duyck
  2024-12-10 12:27       ` Yunsheng Lin
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2024-12-09 16:03 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On Mon, Dec 9, 2024 at 3:42 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/12/9 5:34, Alexander Duyck wrote:
>
> ...
>
> >>
> >> Performance validation for part2:
> >> 1. Using micro-benchmark ko added in patch 1 to test aligned and
> >>    non-aligned API performance impact for the existing users, there
> >>    seems to be about 20% performance degradation for refactoring
> >>    page_frag to support the new API, which seems to nullify most of
> >>    the performance gain in [3] of part1.
> >
> > So if I am understanding correctly then this is showing a 20%
> > performance degradation with this patchset. I would argue that it is
> > significant enough that it would be a blocking factor for this patch
> > set. I would suggest bisecting the patch set to identify where the
> > performance degradation has been added and see what we can do to
> > resolve it, and if nothing else document it in that patch so we can
> > identify the root cause for the slowdown.
>
> The only patch in this patchset affecting the performance of existing API
> seems to be patch 1, only including patch 1 does show ~20% performance
> degradation as including the whole patchset does:
> mm: page_frag: some minor refactoring before adding new API
>
> And the cause seems to be about the binary increasing as below, as the
> performance degradation didn't seems to change much when I tried inlining
> the __page_frag_cache_commit_noref() by moving it to the header file:
>
> ./scripts/bloat-o-meter vmlinux_orig vmlinux
> add/remove: 3/2 grow/shrink: 5/0 up/down: 920/-500 (420)
> Function                                     old     new   delta
> __page_frag_cache_prepare                      -     500    +500
> __napi_alloc_frag_align                       68     180    +112
> __netdev_alloc_skb                           488     596    +108
> napi_alloc_skb                               556     624     +68
> __netdev_alloc_frag_align                    196     252     +56
> svc_tcp_sendmsg                              340     376     +36
> __page_frag_cache_commit_noref                 -      32     +32
> e843419@09a6_0000bd47_30                       -       8      +8
> e843419@0369_000044ee_684                      8       -      -8
> __page_frag_alloc_align                      492       -    -492
> Total: Before=34719207, After=34719627, chg +0.00%
>
> ./scripts/bloat-o-meter page_frag_test_orig.ko page_frag_test.ko
> add/remove: 0/0 grow/shrink: 2/0 up/down: 78/0 (78)
> Function                                     old     new   delta
> page_frag_push_thread                        508     580     +72
> __UNIQUE_ID_vermagic367                       67      73      +6
> Total: Before=4582, After=4660, chg +1.70%

Other than code size have you tried using perf to profile the
benchmark before and after. I suspect that would be telling about
which code changes are the most likely to be causing the issues.
Overall I don't think the size has increased all that much. I suspect
most of this is the fact that you are inlining more of the
functionality.

> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
> new API can make use of them as much as possible.
>
> Any better idea to reuse common codes as much as possible while avoiding
> the performance degradation as much as possible?
>
> >
> >> 2. Use the below netcat test case, there seems to be some minor
> >>    performance gain for replacing 'page_frag' with 'page_frag_cache'
> >>    using the new page_frag API after this patchset.
> >>    server: taskset -c 32 nc -l -k 1234 > /dev/null
> >>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
> >
> > This test would barely touch the page pool. The fact is most of the
>
> I am guessing you meant page_frag here?
>
> > overhead for this would likely be things like TCP latency and data
> > copy much more than the page allocation. As such fluctuations here are
> > likely not related to your changes.
>
> But it does tell us something that the replacing does not seems to
> cause obvious regression, right?

Not really. The fragment allocator is such a small portion of this
test that we could probably double the cost for it and it would still
be negligible.

> I tried using a smaller MTU to amplify the impact of page allocation,
> it seemed to have a similar result.

Not surprising. However the network is likely only a small part of
this. I suspect if you ran a profile it would likely show the same.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-09 16:03     ` Alexander Duyck
@ 2024-12-10 12:27       ` Yunsheng Lin
  2024-12-10 15:58         ` Alexander Duyck
  0 siblings, 1 reply; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-10 12:27 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On 2024/12/10 0:03, Alexander Duyck wrote:

...

> 
> Other than code size have you tried using perf to profile the
> benchmark before and after. I suspect that would be telling about
> which code changes are the most likely to be causing the issues.
> Overall I don't think the size has increased all that much. I suspect
> most of this is the fact that you are inlining more of the
> functionality.

It seems the testing result is very sensitive to code changing and
reorganizing, as using the patch at the end to avoid the problem of
'perf stat' not including data from the kernel thread seems to provide
more reasonable performance data.

It seems the most obvious difference is 'insn per cycle' and I am not
sure how to interpret the difference of below data for the performance
degradation yet.

With patch 1:
 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5473.815250      task-clock (msec)         #    0.984 CPUs utilized
                18      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.022 K/sec
       14210894727      cycles                    #    2.596 GHz                      (92.78%)
       18903171767      instructions              #    1.33  insn per cycle           (92.82%)
        2997494420      branches                  #  547.606 M/sec                    (92.84%)
           7539978      branch-misses             #    0.25% of all branches          (92.84%)
        6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
          29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
          57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
            347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
        5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
            193877      L1-icache-load-misses                                         (92.91%)
        6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
            137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
               210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.563068950 seconds time elapsed

Without patch 1:
root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':

       5306.644600      task-clock (msec)         #    0.984 CPUs utilized
                15      context-switches          #    0.003 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.023 K/sec
       13776872322      cycles                    #    2.596 GHz                      (92.84%)
       13257649773      instructions              #    0.96  insn per cycle           (92.82%)
        2446901087      branches                  #  461.101 M/sec                    (92.91%)
           7172751      branch-misses             #    0.29% of all branches          (92.84%)
        5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
          38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
          65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
            191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
        4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
            175208      L1-icache-load-misses                                         (92.91%)
        5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
            140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
        4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
               170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       5.395104330 seconds time elapsed


Below is perf data for aligned API without patch 1, as above non-aligned
API also use test_alloc_len as 12, theoretically the performance data
should not be better than the non-aligned API as the aligned API will do
the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
to show otherwise and I am not sure how to interpret that too:
perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':

       2447.553100      task-clock (msec)         #    0.965 CPUs utilized
                 9      context-switches          #    0.004 K/sec
                 1      cpu-migrations            #    0.000 K/sec
               122      page-faults               #    0.050 K/sec
        6354149177      cycles                    #    2.596 GHz                      (92.81%)
        6467793726      instructions              #    1.02  insn per cycle           (92.76%)
        1120749183      branches                  #  457.906 M/sec                    (92.81%)
           7370402      branch-misses             #    0.66% of all branches          (92.81%)
        2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
          39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
          42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
             95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
        2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
            118902      L1-icache-load-misses                                         (92.97%)
        3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
             81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
        2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
               159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
   <not supported>      L1-dcache-prefetches
   <not supported>      L1-dcache-prefetch-misses

       2.535085780 seconds time elapsed


> 
>> Patch 1 is about refactoring common codes from __page_frag_alloc_va_align()
>> to __page_frag_cache_prepare() and __page_frag_cache_commit(), so that the
>> new API can make use of them as much as possible.
>>
>> Any better idea to reuse common codes as much as possible while avoiding
>> the performance degradation as much as possible?
>>
>>>
>>>> 2. Use the below netcat test case, there seems to be some minor
>>>>    performance gain for replacing 'page_frag' with 'page_frag_cache'
>>>>    using the new page_frag API after this patchset.
>>>>    server: taskset -c 32 nc -l -k 1234 > /dev/null
>>>>    client: perf stat -r 200 -- taskset -c 0 head -c 20G /dev/zero | taskset -c 1 nc 127.0.0.1 1234
>>>
>>> This test would barely touch the page pool. The fact is most of the
>>
>> I am guessing you meant page_frag here?
>>
>>> overhead for this would likely be things like TCP latency and data
>>> copy much more than the page allocation. As such fluctuations here are
>>> likely not related to your changes.
>>
>> But it does tell us something that the replacing does not seems to
>> cause obvious regression, right?
> 
> Not really. The fragment allocator is such a small portion of this
> test that we could probably double the cost for it and it would still
> be negligible.

The most beneficial thing for replacing of the old API seems to be about
batching of page->_refcount updating and avoid some page_address(), but
may have overhead from unifying of page_frag API.

> 
>> I tried using a smaller MTU to amplify the impact of page allocation,
>> it seemed to have a similar result.
> 
> Not surprising. However the network is likely only a small part of
> this. I suspect if you ran a profile it would likely show the same.
> 

patch for doing the push operation in the insmod process instead of
in the kernel thread as 'perf stat' does not seem to include the data
of kernel thread:
diff --git a/tools/testing/selftests/mm/page_frag/page_frag_test.c b/tools/testing/selftests/mm/page_frag/page_frag_test.c
index e806c1866e36..a818431c38b8 100644
--- a/tools/testing/selftests/mm/page_frag/page_frag_test.c
+++ b/tools/testing/selftests/mm/page_frag/page_frag_test.c
@@ -131,30 +131,39 @@ static int __init page_frag_test_init(void)
        init_completion(&wait);

        if (test_alloc_len > PAGE_SIZE || test_alloc_len <= 0 ||
-           !cpu_active(test_push_cpu) || !cpu_active(test_pop_cpu))
+           !cpu_active(test_pop_cpu))
                return -EINVAL;

        ret = ptr_ring_init(&ptr_ring, nr_objs, GFP_KERNEL);
        if (ret)
                return ret;

-       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
-                                        test_push_cpu, "page_frag_push");
-       if (IS_ERR(tsk_push))
-               return PTR_ERR(tsk_push);
-
        tsk_pop = kthread_create_on_cpu(page_frag_pop_thread, &ptr_ring,
                                        test_pop_cpu, "page_frag_pop");
-       if (IS_ERR(tsk_pop)) {
-               kthread_stop(tsk_push);
+       if (IS_ERR(tsk_pop))
                return PTR_ERR(tsk_pop);
+
+       pr_info("test_push_cpu = %d\n", test_push_cpu);
+
+       if (test_push_cpu < 0)
+               goto skip_push_thread;
+
+       tsk_push = kthread_create_on_cpu(page_frag_push_thread, &ptr_ring,
+                                        test_push_cpu, "page_frag_push");
+       if (IS_ERR(tsk_push)) {
+               kthread_stop(tsk_pop);
+               return PTR_ERR(tsk_push);
        }

+skip_push_thread:
        start = ktime_get();
-       wake_up_process(tsk_push);
+       pr_info("waiting for test to complete\n");
        wake_up_process(tsk_pop);

-       pr_info("waiting for test to complete\n");
+       if (test_push_cpu < 0)
+               page_frag_push_thread(&ptr_ring);
+       else
+               wake_up_process(tsk_push);

        while (!wait_for_completion_timeout(&wait, msecs_to_jiffies(10000))) {
                /* exit if there is no progress for push or pop size */


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-10 12:27       ` Yunsheng Lin
@ 2024-12-10 15:58         ` Alexander Duyck
  2024-12-11 12:52           ` Yunsheng Lin
  0 siblings, 1 reply; 18+ messages in thread
From: Alexander Duyck @ 2024-12-10 15:58 UTC (permalink / raw)
  To: Yunsheng Lin
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On Tue, Dec 10, 2024 at 4:27 AM Yunsheng Lin <linyunsheng@huawei.com> wrote:
>
> On 2024/12/10 0:03, Alexander Duyck wrote:
>
> ...
>
> >
> > Other than code size have you tried using perf to profile the
> > benchmark before and after. I suspect that would be telling about
> > which code changes are the most likely to be causing the issues.
> > Overall I don't think the size has increased all that much. I suspect
> > most of this is the fact that you are inlining more of the
> > functionality.
>
> It seems the testing result is very sensitive to code changing and
> reorganizing, as using the patch at the end to avoid the problem of
> 'perf stat' not including data from the kernel thread seems to provide
> more reasonable performance data.
>
> It seems the most obvious difference is 'insn per cycle' and I am not
> sure how to interpret the difference of below data for the performance
> degradation yet.
>
> With patch 1:
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5473.815250      task-clock (msec)         #    0.984 CPUs utilized
>                 18      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.022 K/sec
>        14210894727      cycles                    #    2.596 GHz                      (92.78%)
>        18903171767      instructions              #    1.33  insn per cycle           (92.82%)
>         2997494420      branches                  #  547.606 M/sec                    (92.84%)
>            7539978      branch-misses             #    0.25% of all branches          (92.84%)
>         6291190031      L1-dcache-loads           # 1149.325 M/sec                    (92.78%)
>           29874701      L1-dcache-load-misses     #    0.47% of all L1-dcache hits    (92.82%)
>           57979668      LLC-loads                 #   10.592 M/sec                    (92.79%)
>             347822      LLC-load-misses           #    0.01% of all LL-cache hits     (92.90%)
>         5946042629      L1-icache-loads           # 1086.270 M/sec                    (92.91%)
>             193877      L1-icache-load-misses                                         (92.91%)
>         6820220221      dTLB-loads                # 1245.972 M/sec                    (92.91%)
>             137999      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         5947607438      iTLB-loads                # 1086.556 M/sec                    (92.91%)
>                210      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.563068950 seconds time elapsed
>
> Without patch 1:
> root@(none):/home# perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000':
>
>        5306.644600      task-clock (msec)         #    0.984 CPUs utilized
>                 15      context-switches          #    0.003 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.023 K/sec
>        13776872322      cycles                    #    2.596 GHz                      (92.84%)
>        13257649773      instructions              #    0.96  insn per cycle           (92.82%)
>         2446901087      branches                  #  461.101 M/sec                    (92.91%)
>            7172751      branch-misses             #    0.29% of all branches          (92.84%)
>         5041456343      L1-dcache-loads           #  950.027 M/sec                    (92.84%)
>           38418414      L1-dcache-load-misses     #    0.76% of all L1-dcache hits    (92.76%)
>           65486400      LLC-loads                 #   12.340 M/sec                    (92.82%)
>             191497      LLC-load-misses           #    0.01% of all LL-cache hits     (92.79%)
>         4906456833      L1-icache-loads           #  924.587 M/sec                    (92.90%)
>             175208      L1-icache-load-misses                                         (92.91%)
>         5539879607      dTLB-loads                # 1043.952 M/sec                    (92.91%)
>             140166      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.91%)
>         4906685698      iTLB-loads                #  924.631 M/sec                    (92.91%)
>                170      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.66%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        5.395104330 seconds time elapsed
>
>
> Below is perf data for aligned API without patch 1, as above non-aligned
> API also use test_alloc_len as 12, theoretically the performance data
> should not be better than the non-aligned API as the aligned API will do
> the aligning of fragsz basing on SMP_CACHE_BYTES, but the testing seems
> to show otherwise and I am not sure how to interpret that too:
> perf stat -d -d -d taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1
> insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable
>
>  Performance counter stats for 'taskset -c 0 insmod ./page_frag_test.ko test_push_cpu=-1 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1':
>
>        2447.553100      task-clock (msec)         #    0.965 CPUs utilized
>                  9      context-switches          #    0.004 K/sec
>                  1      cpu-migrations            #    0.000 K/sec
>                122      page-faults               #    0.050 K/sec
>         6354149177      cycles                    #    2.596 GHz                      (92.81%)
>         6467793726      instructions              #    1.02  insn per cycle           (92.76%)
>         1120749183      branches                  #  457.906 M/sec                    (92.81%)
>            7370402      branch-misses             #    0.66% of all branches          (92.81%)
>         2847963759      L1-dcache-loads           # 1163.596 M/sec                    (92.76%)
>           39439592      L1-dcache-load-misses     #    1.38% of all L1-dcache hits    (92.77%)
>           42553468      LLC-loads                 #   17.386 M/sec                    (92.71%)
>              95960      LLC-load-misses           #    0.01% of all LL-cache hits     (92.94%)
>         2554887203      L1-icache-loads           # 1043.854 M/sec                    (92.97%)
>             118902      L1-icache-load-misses                                         (92.97%)
>         3365755289      dTLB-loads                # 1375.151 M/sec                    (92.97%)
>              81401      dTLB-load-misses          #    0.00% of all dTLB cache hits   (92.97%)
>         2554882937      iTLB-loads                # 1043.852 M/sec                    (92.97%)
>                159      iTLB-load-misses          #    0.00% of all iTLB cache hits   (85.58%)
>    <not supported>      L1-dcache-prefetches
>    <not supported>      L1-dcache-prefetch-misses
>
>        2.535085780 seconds time elapsed

I'm not sure perf stat will tell us much as it is really too high
level to give us much in the way of details. I would be more
interested in the output from perf record -g followed by a perf
report, or maybe even just a snapshot from perf top while the test is
running. That should show us where the CPU is spending most of its
time and what areas are hot in the before and after graphs.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-10 15:58         ` Alexander Duyck
@ 2024-12-11 12:52           ` Yunsheng Lin
  2024-12-13 12:09             ` Yunsheng Lin
  0 siblings, 1 reply; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-11 12:52 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On 2024/12/10 23:58, Alexander Duyck wrote:

> 
> I'm not sure perf stat will tell us much as it is really too high
> level to give us much in the way of details. I would be more
> interested in the output from perf record -g followed by a perf
> report, or maybe even just a snapshot from perf top while the test is
> running. That should show us where the CPU is spending most of its
> time and what areas are hot in the before and after graphs.

It seems the bottleneck is in the freeing side that page_frag_free()
function took up to about 50% cpu for non-aligned API and 16% cpu
for aligned API in the push CPU using 'perf top'.

Using the below patch cause the page_frag_free() to disappear in the
push CPU  of 'perf top', new performance data is below:
Without patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         21.084113      task-clock (msec)         #    0.008 CPUs utilized            ( +-  1.59% )
                 7      context-switches          #    0.334 K/sec                    ( +-  1.25% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 20.20% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.26% )
          54748233      cycles                    #    2.597 GHz                      ( +-  1.59% )
          61637051      instructions              #    1.13  insn per cycle           ( +-  0.13% )
          14727268      branches                  #  698.501 M/sec                    ( +-  0.11% )
             20178      branch-misses             #    0.14% of all branches          ( +-  0.94% )

       2.637345524 seconds time elapsed                                          ( +-  0.19% )

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         19.669259      task-clock (msec)         #    0.009 CPUs utilized            ( +-  2.91% )
                 7      context-switches          #    0.356 K/sec                    ( +-  1.04% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +- 68.82% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.27% )
          51077447      cycles                    #    2.597 GHz                      ( +-  2.91% )
          58875368      instructions              #    1.15  insn per cycle           ( +-  4.47% )
          14040015      branches                  #  713.805 M/sec                    ( +-  4.68% )
             20150      branch-misses             #    0.14% of all branches          ( +-  0.64% )

       2.226539190 seconds time elapsed                                          ( +-  0.12% )

With patch 1:
 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000' (20 runs):

         20.782788      task-clock (msec)         #    0.008 CPUs utilized            ( +-  0.09% )
                 7      context-switches          #    0.342 K/sec                    ( +-  0.97% )
                 1      cpu-migrations            #    0.031 K/sec                    ( +- 16.83% )
                78      page-faults               #    0.004 M/sec                    ( +-  0.31% )
          53967333      cycles                    #    2.597 GHz                      ( +-  0.08% )
          61577257      instructions              #    1.14  insn per cycle           ( +-  0.02% )
          14712140      branches                  #  707.900 M/sec                    ( +-  0.02% )
             20234      branch-misses             #    0.14% of all branches          ( +-  0.55% )

       2.677974457 seconds time elapsed                                          ( +-  0.15% )

root@(none):/home# perf stat -r 20 insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1

insmod: can't insert './page_frag_test.ko': Resource temporarily unavailable

 Performance counter stats for 'insmod ./page_frag_test.ko test_push_cpu=0 test_pop_cpu=1 test_alloc_len=12 nr_test=51200000 test_align=1' (20 runs):

         20.420537      task-clock (msec)         #    0.009 CPUs utilized            ( +-  0.05% )
                 7      context-switches          #    0.345 K/sec                    ( +-  0.71% )
                 0      cpu-migrations            #    0.005 K/sec                    ( +-100.00% )
                77      page-faults               #    0.004 M/sec                    ( +-  0.23% )
          53038942      cycles                    #    2.597 GHz                      ( +-  0.05% )
          59965712      instructions              #    1.13  insn per cycle           ( +-  0.03% )
          14372507      branches                  #  703.826 M/sec                    ( +-  0.03% )
             20580      branch-misses             #    0.14% of all branches          ( +-  0.56% )

       2.287783171 seconds time elapsed                                          ( +-  0.12% )

It seems that bottleneck is still the freeing side that the above
result might not be as meaningful as it should be.

As we can't use more than one cpu for the free side without some
lock using a single ptr_ring, it seems something more complicated
might need to be done in order to support more than one CPU for the
freeing side?

Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
CPU using 'perf top'.
After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
took up to 4.67% + 1.01% = 5.68%.
Having a similar result, I am not sure if the CPU usages is able tell us
the performance degradation here as it seems to be quite large?

@@ -100,13 +100,20 @@ static int page_frag_push_thread(void *arg)
                if (!va)
                        continue;

-               ret = __ptr_ring_produce(ring, va);
-               if (ret) {
+               do {
+                       ret = __ptr_ring_produce(ring, va);
+                       if (!ret) {
+                               va = NULL;
+                               break;
+                       } else {
+                               cond_resched();
+                       }
+               } while (!force_exit);
+
+               if (va)
                        page_frag_free(va);
-                       cond_resched();
-               } else {
+               else
                        test_pushed++;
-               }
        }

        pr_info("page_frag push test thread exits on cpu %d\n",



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2)
  2024-12-11 12:52           ` Yunsheng Lin
@ 2024-12-13 12:09             ` Yunsheng Lin
  0 siblings, 0 replies; 18+ messages in thread
From: Yunsheng Lin @ 2024-12-13 12:09 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: davem, kuba, pabeni, netdev, linux-kernel, Shuah Khan,
	Andrew Morton, Linux-MM

On 2024/12/11 20:52, Yunsheng Lin wrote:
 > It seems that bottleneck is still the freeing side that the above
> result might not be as meaningful as it should be.

Through 'perf top' annotating, there seems to be about 70%+ cpu usage
for the atmoic operation of put_page_testzero() in page_frag_free(),
it was unexpected that the atmoic operation had that much overhead:(

> 
> As we can't use more than one cpu for the free side without some
> lock using a single ptr_ring, it seems something more complicated
> might need to be done in order to support more than one CPU for the
> freeing side?
> 
> Before patch 1, __page_frag_alloc_align took up to 3.62% percent of
> CPU using 'perf top'.
> After patch 1, __page_frag_cache_prepare() and __page_frag_cache_commit_noref()
> took up to 4.67% + 1.01% = 5.68%.
> Having a similar result, I am not sure if the CPU usages is able tell us
> the performance degradation here as it seems to be quite large?
> 

And using 'struct page_frag' to pass the parameter seems to cause some
observable overhead as the testing is very low level, peformance seems to
be negligible using the below patch to avoid passing 'struct page_frag',
3.62% and 3.27% for the cpu usages for __page_frag_alloc_align() before
patch 1 and __page_frag_cache_prepare() after patch 1 respectively.

The new refatcoring avoid some overhead for the old API, but might cause
some overhead for the new API as it is not able to skip the virt_to_page()
for refilling and reusing case, though it seems to be an unlikely case.
Or any better idea how to do refatcoring for unifying the page_frag API?

diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 41a91df82631..b83e7655654e 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -39,8 +39,24 @@ static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)

 void page_frag_cache_drain(struct page_frag_cache *nc);
 void __page_frag_cache_drain(struct page *page, unsigned int count);
-void *__page_frag_alloc_align(struct page_frag_cache *nc, unsigned int fragsz,
-			      gfp_t gfp_mask, unsigned int align_mask);
+void *__page_frag_cache_prepare(struct page_frag_cache *nc, unsigned int fragsz,
+				gfp_t gfp_mask, unsigned int align_mask);
+
+static inline void *__page_frag_alloc_align(struct page_frag_cache *nc,
+					    unsigned int fragsz, gfp_t gfp_mask,
+					    unsigned int align_mask)
+{
+	void *va;
+
+	va = __page_frag_cache_prepare(nc, fragsz, gfp_mask, align_mask);
+	if (likely(va)) {
+		va += nc->offset;
+		nc->offset += fragsz;
+		nc->pagecnt_bias--;
+	}
+
+	return va;
+}

 static inline void *page_frag_alloc_align(struct page_frag_cache *nc,
 					  unsigned int fragsz, gfp_t gfp_mask,
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index 3f7a203d35c6..729309aee27a 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -90,9 +90,9 @@ void __page_frag_cache_drain(struct page *page, unsigned int count)
 }
 EXPORT_SYMBOL(__page_frag_cache_drain);

-void *__page_frag_alloc_align(struct page_frag_cache *nc,
-			      unsigned int fragsz, gfp_t gfp_mask,
-			      unsigned int align_mask)
+void *__page_frag_cache_prepare(struct page_frag_cache *nc,
+				unsigned int fragsz, gfp_t gfp_mask,
+				unsigned int align_mask)
 {
 	unsigned long encoded_page = nc->encoded_page;
 	unsigned int size, offset;
@@ -151,12 +151,10 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
 		offset = 0;
 	}

-	nc->pagecnt_bias--;
-	nc->offset = offset + fragsz;
-
-	return encoded_page_decode_virt(encoded_page) + offset;
+	nc->offset = offset;
+	return encoded_page_decode_virt(encoded_page);
 }
-EXPORT_SYMBOL(__page_frag_alloc_align);
+EXPORT_SYMBOL(__page_frag_cache_prepare);

 /*
  * Frees a page fragment allocated out of either a compound or order 0 page.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-12-13 12:09 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-06 12:25 [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 01/10] mm: page_frag: some minor refactoring before adding new API Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 02/10] net: rename skb_copy_to_page_nocache() helper Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 03/10] mm: page_frag: update documentation for page_frag Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 04/10] mm: page_frag: introduce page_frag_alloc_abort() related API Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 05/10] mm: page_frag: introduce refill prepare & commit API Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 06/10] mm: page_frag: introduce alloc_refill " Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 07/10] mm: page_frag: introduce probe related API Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 08/10] mm: page_frag: add testing for the newly added API Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 09/10] net: replace page_frag with page_frag_cache Yunsheng Lin
2024-12-06 12:25 ` [PATCH net-next v2 10/10] mm: page_frag: add an entry in MAINTAINERS for page_frag Yunsheng Lin
2024-12-08 21:34 ` [PATCH net-next v2 00/10] Replace page_frag with page_frag_cache (Part-2) Alexander Duyck
2024-12-09 11:42   ` Yunsheng Lin
2024-12-09 16:03     ` Alexander Duyck
2024-12-10 12:27       ` Yunsheng Lin
2024-12-10 15:58         ` Alexander Duyck
2024-12-11 12:52           ` Yunsheng Lin
2024-12-13 12:09             ` Yunsheng Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox