[PATCH net-next 0/3] net/smc: buffer allocation and registration improvements

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements
@ 2026-01-23  8:23 D. Wythe
  2026-01-23  8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
                   ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-23  8:23 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang
  Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

This series improves SMC-R buffer management by refining the allocation
logic and reducing hardware resource overhead during registration. 

The primary improvement is the significant reduction in MTTE consumption.
By aligning IB registration with the actual physical block sizes, we can
reduce the entry count from one per 4KB page to just one per contiguous
block. This is especially beneficial for large buffers, preventing
hardware resource exhaustion on RDMA NICs.

D. Wythe (3):
  net/smc: cap allocation order for SMC-R physically contiguous buffers
  mm: vmalloc: export find_vm_area()
  net/smc: optimize MTTE consumption for SMC-R buffers

 mm/vmalloc.c       |  1 +
 net/smc/smc_core.c | 31 ++++++++++++++++++-------------
 net/smc/smc_ib.c   | 23 ++++++++++++++++++++---
 3 files changed, 39 insertions(+), 16 deletions(-)

-- 
2.45.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers
  2026-01-23  8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe
@ 2026-01-23  8:23 ` D. Wythe
  2026-01-23 10:54   ` Alexandra Winter
  2026-01-23  8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe
  2026-01-23  8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe
  2 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-23  8:23 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang
  Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

The alloc_page() cannot satisfy requests exceeding MAX_PAGE_ORDER,
and attempting such allocations will lead to guaranteed failures
and potential kernel warnings.

For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to
MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible
physically contiguous chunk instead of failing with an invalid order,
which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create().

For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the
doomed physical allocation attempt and fallback to virtual memory
immediately.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc_core.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index e4eabc83719e..6219db498976 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2324,26 +2324,30 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
 	if (!buf_desc)
 		return ERR_PTR(-ENOMEM);
 
+	buf_desc->order = get_order(bufsize);
+
 	switch (lgr->buf_type) {
 	case SMCR_PHYS_CONT_BUFS:
+		buf_desc->order = min(buf_desc->order, MAX_PAGE_ORDER);
+		fallthrough;
 	case SMCR_MIXED_BUFS:
-		buf_desc->order = get_order(bufsize);
-		buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
-					      __GFP_NOMEMALLOC | __GFP_COMP |
-					      __GFP_NORETRY | __GFP_ZERO,
-					      buf_desc->order);
-		if (buf_desc->pages) {
-			buf_desc->cpu_addr =
-				(void *)page_address(buf_desc->pages);
-			buf_desc->len = bufsize;
-			buf_desc->is_vm = false;
-			break;
+		if (buf_desc->order <= MAX_PAGE_ORDER) {
+			buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+						      __GFP_NOMEMALLOC | __GFP_COMP |
+						      __GFP_NORETRY | __GFP_ZERO,
+						      buf_desc->order);
+			if (buf_desc->pages) {
+				buf_desc->cpu_addr =
+					(void *)page_address(buf_desc->pages);
+				buf_desc->len = bufsize;
+				buf_desc->is_vm = false;
+				break;
+			}
 		}
 		if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
 			goto out;
 		fallthrough;	// try virtually contiguous buf
 	case SMCR_VIRT_CONT_BUFS:
-		buf_desc->order = get_order(bufsize);
 		buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order);
 		if (!buf_desc->cpu_addr)
 			goto out;
-- 
2.45.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers
  2026-01-23  8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
@ 2026-01-23 10:54   ` Alexandra Winter
  2026-01-24  9:22     ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Alexandra Winter @ 2026-01-23 10:54 UTC (permalink / raw)
  To: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang
  Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

On 23.01.26 09:23, D. Wythe wrote:
Describe your changes in imperative mood.
https://docs.kernel.org/process/submitting-patches.html#describe-your-changes

> For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to
> MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible
> physically contiguous chunk instead of failing with an invalid order,
> which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create().
> 
> For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the
> doomed physical allocation attempt and fallback to virtual memory
> immediately.
> 

Proposal for a version in imperative mood (iiuc):
"
For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER. This
ensures the attempts to allocate the largest possible physically contiguous
chunk succeed, instead of failing with an invalid order. This also avoids
redundant "try-fail-degrade" cycles in __smc_buf_create().

For SMCR_MIXED_BUFS, if its order exceeds MAX_PAGE_ORDER, skip the doomed
physical allocation attempt and fallback to virtual memory immediately.
"

> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> Reviewed-by: Dust Li <dust.li@linux.alibaba.com>

Other than that: LGTM
Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers
  2026-01-23 10:54   ` Alexandra Winter
@ 2026-01-24  9:22     ` D. Wythe
  0 siblings, 0 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-24  9:22 UTC (permalink / raw)
  To: Alexandra Winter
  Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Fri, Jan 23, 2026 at 11:54:37AM +0100, Alexandra Winter wrote:
> 
> 
> On 23.01.26 09:23, D. Wythe wrote:
> Describe your changes in imperative mood.
> https://docs.kernel.org/process/submitting-patches.html#describe-your-changes
> 
> 
> > For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to
> > MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible
> > physically contiguous chunk instead of failing with an invalid order,
> > which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create().
> > 
> > For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the
> > doomed physical allocation attempt and fallback to virtual memory
> > immediately.
> > 
> 
> Proposal for a version in imperative mood (iiuc):
> "
> For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER. This
> ensures the attempts to allocate the largest possible physically contiguous
> chunk succeed, instead of failing with an invalid order. This also avoids
> redundant "try-fail-degrade" cycles in __smc_buf_create().
> 
> For SMCR_MIXED_BUFS, if its order exceeds MAX_PAGE_ORDER, skip the doomed
> physical allocation attempt and fallback to virtual memory immediately.
> "
> 
> 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
> 
> Other than that: LGTM
> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com>

Hi Alexandra,

Thank you for your review and for providing the refined description.
I will use your suggested wording for the commit message in V2.

Best regards,
D. Wythe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-23  8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe
  2026-01-23  8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
@ 2026-01-23  8:23 ` D. Wythe
  2026-01-23 14:44   ` Christoph Hellwig
  2026-01-23 18:55   ` Uladzislau Rezki
  2026-01-23  8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe
  2 siblings, 2 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-23  8:23 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang
  Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

find_vm_area() provides a way to find the vm_struct associated with a
virtual address. Export this symbol to modules so that modularized
subsystems can perform lookups on vmalloc addresses.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 mm/vmalloc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index ecbac900c35f..3eb9fe761c34 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
 
 	return va->vm;
 }
+EXPORT_SYMBOL_GPL(find_vm_area);
 
 /**
  * remove_vm_area - find and remove a continuous kernel virtual area
-- 
2.45.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-23  8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe
@ 2026-01-23 14:44   ` Christoph Hellwig
  2026-01-23 18:55   ` Uladzislau Rezki
  1 sibling, 0 replies; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-23 14:44 UTC (permalink / raw)
  To: D. Wythe
  Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> find_vm_area() provides a way to find the vm_struct associated with a
> virtual address. Export this symbol to modules so that modularized
> subsystems can perform lookups on vmalloc addresses.

No, they have absolutely no business doing that.  This functionality
is very intentionally kept private.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-23  8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe
  2026-01-23 14:44   ` Christoph Hellwig
@ 2026-01-23 18:55   ` Uladzislau Rezki
  2026-01-24  9:35     ` D. Wythe
  1 sibling, 1 reply; 30+ messages in thread
From: Uladzislau Rezki @ 2026-01-23 18:55 UTC (permalink / raw)
  To: D. Wythe
  Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> find_vm_area() provides a way to find the vm_struct associated with a
> virtual address. Export this symbol to modules so that modularized
> subsystems can perform lookups on vmalloc addresses.
> 
> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> ---
>  mm/vmalloc.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index ecbac900c35f..3eb9fe761c34 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
>  
>  	return va->vm;
>  }
> +EXPORT_SYMBOL_GPL(find_vm_area);
>  
This is internal. We can not just export it.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-23 18:55   ` Uladzislau Rezki
@ 2026-01-24  9:35     ` D. Wythe
  2026-01-24 10:48       ` Uladzislau Rezki
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-24  9:35 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang,
	Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > find_vm_area() provides a way to find the vm_struct associated with a
> > virtual address. Export this symbol to modules so that modularized
> > subsystems can perform lookups on vmalloc addresses.
> > 
> > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > ---
> >  mm/vmalloc.c | 1 +
> >  1 file changed, 1 insertion(+)
> > 
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index ecbac900c35f..3eb9fe761c34 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> >  
> >  	return va->vm;
> >  }
> > +EXPORT_SYMBOL_GPL(find_vm_area);
> >  
> This is internal. We can not just export it.
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

Thank you for the feedback. I agree that we should avoid exposing
internal implementation details like struct vm_struct to external
subsystems.

Following Christoph's suggestion, I'm planning to encapsulate the page
order lookup into a minimal helper instead:

unsigned int vmalloc_page_order(const void *addr){
	struct vm_struct *vm;
 	vm = find_vm_area(addr);
	return vm ? vm->page_order : 0;
}
EXPORT_SYMBOL_GPL(vmalloc_page_order);

Does this approach look reasonable to you? It would keep the vm_struct
layout private while satisfying the optimization needs of SMC.

Thanks,
D. Wythe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-24  9:35     ` D. Wythe
@ 2026-01-24 10:48       ` Uladzislau Rezki
  2026-01-24 14:57         ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Uladzislau Rezki @ 2026-01-24 10:48 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

Hello, D. Wythe!

> On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > find_vm_area() provides a way to find the vm_struct associated with a
> > > virtual address. Export this symbol to modules so that modularized
> > > subsystems can perform lookups on vmalloc addresses.
> > > 
> > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > ---
> > >  mm/vmalloc.c | 1 +
> > >  1 file changed, 1 insertion(+)
> > > 
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index ecbac900c35f..3eb9fe761c34 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > >  
> > >  	return va->vm;
> > >  }
> > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > >  
> > This is internal. We can not just export it.
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> Thank you for the feedback. I agree that we should avoid exposing
> internal implementation details like struct vm_struct to external
> subsystems.
> 
> Following Christoph's suggestion, I'm planning to encapsulate the page
> order lookup into a minimal helper instead:
> 
> unsigned int vmalloc_page_order(const void *addr){
> 	struct vm_struct *vm;
>  	vm = find_vm_area(addr);
> 	return vm ? vm->page_order : 0;
> }
> EXPORT_SYMBOL_GPL(vmalloc_page_order);
> 
> Does this approach look reasonable to you? It would keep the vm_struct
> layout private while satisfying the optimization needs of SMC.
> 
Could you please clarify why you need info about page_order? I have not
looked at your second patch.

Thanks!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-24 10:48       ` Uladzislau Rezki
@ 2026-01-24 14:57         ` D. Wythe
  2026-01-26 10:28           ` Uladzislau Rezki
  2026-01-27 13:34           ` Leon Romanovsky
  0 siblings, 2 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-24 14:57 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang,
	Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> Hello, D. Wythe!
> 
> > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > virtual address. Export this symbol to modules so that modularized
> > > > subsystems can perform lookups on vmalloc addresses.
> > > > 
> > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > ---
> > > >  mm/vmalloc.c | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > > 
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > >  
> > > >  	return va->vm;
> > > >  }
> > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > >  
> > > This is internal. We can not just export it.
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > Thank you for the feedback. I agree that we should avoid exposing
> > internal implementation details like struct vm_struct to external
> > subsystems.
> > 
> > Following Christoph's suggestion, I'm planning to encapsulate the page
> > order lookup into a minimal helper instead:
> > 
> > unsigned int vmalloc_page_order(const void *addr){
> > 	struct vm_struct *vm;
> >  	vm = find_vm_area(addr);
> > 	return vm ? vm->page_order : 0;
> > }
> > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > 
> > Does this approach look reasonable to you? It would keep the vm_struct
> > layout private while satisfying the optimization needs of SMC.
> > 
> Could you please clarify why you need info about page_order? I have not
> looked at your second patch.
> 
> Thanks!
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

This stems from optimizing memory registration in SMC-R. To provide the
RDMA hardware with direct access to memory buffers, we must register
them with the NIC. During this process, the hardware generates one MTT
entry for each physically contiguous block. Since these hardware entries
are a finite and scarce resource, and SMC currently defaults to a 4KB
registration granularity, a single 2MB buffer consumes 512 entries. In
high-concurrency scenarios, this inefficiency quickly exhausts NIC
resources and becomes a major bottleneck for system scalability.

To address this, we intend to use vmalloc_huge(). When it successfully
allocates high-order pages, the vmalloc area is backed by a sequence of
physically contiguous chunks (e.g., 2MB each). If we know this
page_order, we can register these larger physical blocks instead of
individual 4KB pages, reducing MTT consumption from 512 entries down to
1 for every 2MB of memory (with page_order == 9).

However, the result of vmalloc_huge() is currently opaque to the caller.
We cannot determine whether it successfully allocated huge pages or fell
back to 4KB pages based solely on the returned pointer. Therefore, we
need a helper function to query the actual page order, enabling SMC-R to
adapt its registration logic to the underlying physical layout.

I hope this clarifies our design motivation!

Best regards,
D. Wythe

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-24 14:57         ` D. Wythe
@ 2026-01-26 10:28           ` Uladzislau Rezki
  2026-01-26 12:02             ` D. Wythe
  2026-01-27 13:34           ` Leon Romanovsky
  1 sibling, 1 reply; 30+ messages in thread
From: Uladzislau Rezki @ 2026-01-26 10:28 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

Hello, D. Wythe!

> > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > virtual address. Export this symbol to modules so that modularized
> > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > 
> > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > >  
> > > > >  	return va->vm;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > >  
> > > > This is internal. We can not just export it.
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > Thank you for the feedback. I agree that we should avoid exposing
> > > internal implementation details like struct vm_struct to external
> > > subsystems.
> > > 
> > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > order lookup into a minimal helper instead:
> > > 
> > > unsigned int vmalloc_page_order(const void *addr){
> > > 	struct vm_struct *vm;
> > >  	vm = find_vm_area(addr);
> > > 	return vm ? vm->page_order : 0;
> > > }
> > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > 
> > > Does this approach look reasonable to you? It would keep the vm_struct
> > > layout private while satisfying the optimization needs of SMC.
> > > 
> > Could you please clarify why you need info about page_order? I have not
> > looked at your second patch.
> > 
> > Thanks!
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> This stems from optimizing memory registration in SMC-R. To provide the
> RDMA hardware with direct access to memory buffers, we must register
> them with the NIC. During this process, the hardware generates one MTT
> entry for each physically contiguous block. Since these hardware entries
> are a finite and scarce resource, and SMC currently defaults to a 4KB
> registration granularity, a single 2MB buffer consumes 512 entries. In
> high-concurrency scenarios, this inefficiency quickly exhausts NIC
> resources and becomes a major bottleneck for system scalability.
> 
> To address this, we intend to use vmalloc_huge(). When it successfully
> allocates high-order pages, the vmalloc area is backed by a sequence of
> physically contiguous chunks (e.g., 2MB each). If we know this
> page_order, we can register these larger physical blocks instead of
> individual 4KB pages, reducing MTT consumption from 512 entries down to
> 1 for every 2MB of memory (with page_order == 9).
> 
> However, the result of vmalloc_huge() is currently opaque to the caller.
> We cannot determine whether it successfully allocated huge pages or fell
> back to 4KB pages based solely on the returned pointer. Therefore, we
> need a helper function to query the actual page order, enabling SMC-R to
> adapt its registration logic to the underlying physical layout.
> 
> I hope this clarifies our design motivation!
> 
Appreciate for the explanation. Yes it clarifies an intention.

As for proposed patch above:

- A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
- It makes sense to get a node, grab a spin-lock and find VM, save
  page_order and release the lock.

You can have a look at the vmalloc_dump_obj(void *object) function.
We try-spinlock there whereas you need just spin-lock. But the idea
is the same.

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-26 10:28           ` Uladzislau Rezki
@ 2026-01-26 12:02             ` D. Wythe
  2026-01-26 16:45               ` Uladzislau Rezki
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-26 12:02 UTC (permalink / raw)
  To: Uladzislau Rezki
  Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang,
	Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote:
> Hello, D. Wythe!
> 
> > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > 
> > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > >  
> > > > > >  	return va->vm;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > >  
> > > > > This is internal. We can not just export it.
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > internal implementation details like struct vm_struct to external
> > > > subsystems.
> > > > 
> > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > order lookup into a minimal helper instead:
> > > > 
> > > > unsigned int vmalloc_page_order(const void *addr){
> > > > 	struct vm_struct *vm;
> > > >  	vm = find_vm_area(addr);
> > > > 	return vm ? vm->page_order : 0;
> > > > }
> > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > 
> > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > layout private while satisfying the optimization needs of SMC.
> > > > 
> > > Could you please clarify why you need info about page_order? I have not
> > > looked at your second patch.
> > > 
> > > Thanks!
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > This stems from optimizing memory registration in SMC-R. To provide the
> > RDMA hardware with direct access to memory buffers, we must register
> > them with the NIC. During this process, the hardware generates one MTT
> > entry for each physically contiguous block. Since these hardware entries
> > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > registration granularity, a single 2MB buffer consumes 512 entries. In
> > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > resources and becomes a major bottleneck for system scalability.
> > 
> > To address this, we intend to use vmalloc_huge(). When it successfully
> > allocates high-order pages, the vmalloc area is backed by a sequence of
> > physically contiguous chunks (e.g., 2MB each). If we know this
> > page_order, we can register these larger physical blocks instead of
> > individual 4KB pages, reducing MTT consumption from 512 entries down to
> > 1 for every 2MB of memory (with page_order == 9).
> > 
> > However, the result of vmalloc_huge() is currently opaque to the caller.
> > We cannot determine whether it successfully allocated huge pages or fell
> > back to 4KB pages based solely on the returned pointer. Therefore, we
> > need a helper function to query the actual page order, enabling SMC-R to
> > adapt its registration logic to the underlying physical layout.
> > 
> > I hope this clarifies our design motivation!
> > 
> Appreciate for the explanation. Yes it clarifies an intention.
> 
> As for proposed patch above:
> 
> - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
> - It makes sense to get a node, grab a spin-lock and find VM, save
>   page_order and release the lock.
> 
> You can have a look at the vmalloc_dump_obj(void *object) function.
> We try-spinlock there whereas you need just spin-lock. But the idea
> is the same.
> 
> --
> Uladzislau Rezki

Hi Uladzislau,

Thanks very much for the detailed guidance, especially on the correct
locking pattern. This is extremely helpful.I will follow it and send
a v2 patch series with the new helper implemented in mm/vmalloc.c.

Thanks again for your support.

Best regards,
D. Wythe




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-26 12:02             ` D. Wythe
@ 2026-01-26 16:45               ` Uladzislau Rezki
  0 siblings, 0 replies; 30+ messages in thread
From: Uladzislau Rezki @ 2026-01-26 16:45 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Mon, Jan 26, 2026 at 08:02:26PM +0800, D. Wythe wrote:
> On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote:
> > Hello, D. Wythe!
> > 
> > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > 
> > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/vmalloc.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > >  
> > > > > > >  	return va->vm;
> > > > > > >  }
> > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > >  
> > > > > > This is internal. We can not just export it.
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > internal implementation details like struct vm_struct to external
> > > > > subsystems.
> > > > > 
> > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > order lookup into a minimal helper instead:
> > > > > 
> > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > 	struct vm_struct *vm;
> > > > >  	vm = find_vm_area(addr);
> > > > > 	return vm ? vm->page_order : 0;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > 
> > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > layout private while satisfying the optimization needs of SMC.
> > > > > 
> > > > Could you please clarify why you need info about page_order? I have not
> > > > looked at your second patch.
> > > > 
> > > > Thanks!
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > This stems from optimizing memory registration in SMC-R. To provide the
> > > RDMA hardware with direct access to memory buffers, we must register
> > > them with the NIC. During this process, the hardware generates one MTT
> > > entry for each physically contiguous block. Since these hardware entries
> > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > resources and becomes a major bottleneck for system scalability.
> > > 
> > > To address this, we intend to use vmalloc_huge(). When it successfully
> > > allocates high-order pages, the vmalloc area is backed by a sequence of
> > > physically contiguous chunks (e.g., 2MB each). If we know this
> > > page_order, we can register these larger physical blocks instead of
> > > individual 4KB pages, reducing MTT consumption from 512 entries down to
> > > 1 for every 2MB of memory (with page_order == 9).
> > > 
> > > However, the result of vmalloc_huge() is currently opaque to the caller.
> > > We cannot determine whether it successfully allocated huge pages or fell
> > > back to 4KB pages based solely on the returned pointer. Therefore, we
> > > need a helper function to query the actual page order, enabling SMC-R to
> > > adapt its registration logic to the underlying physical layout.
> > > 
> > > I hope this clarifies our design motivation!
> > > 
> > Appreciate for the explanation. Yes it clarifies an intention.
> > 
> > As for proposed patch above:
> > 
> > - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined;
> > - It makes sense to get a node, grab a spin-lock and find VM, save
> >   page_order and release the lock.
> > 
> > You can have a look at the vmalloc_dump_obj(void *object) function.
> > We try-spinlock there whereas you need just spin-lock. But the idea
> > is the same.
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> Thanks very much for the detailed guidance, especially on the correct
> locking pattern. This is extremely helpful.I will follow it and send
> a v2 patch series with the new helper implemented in mm/vmalloc.c.
> 
> Thanks again for your support.
> 
Welcome!

--
Uladzislau Rezki


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-24 14:57         ` D. Wythe
  2026-01-26 10:28           ` Uladzislau Rezki
@ 2026-01-27 13:34           ` Leon Romanovsky
  2026-01-28  3:45             ` D. Wythe
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-27 13:34 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > Hello, D. Wythe!
> > 
> > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > virtual address. Export this symbol to modules so that modularized
> > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > 
> > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > ---
> > > > >  mm/vmalloc.c | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > > 
> > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > --- a/mm/vmalloc.c
> > > > > +++ b/mm/vmalloc.c
> > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > >  
> > > > >  	return va->vm;
> > > > >  }
> > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > >  
> > > > This is internal. We can not just export it.
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > Thank you for the feedback. I agree that we should avoid exposing
> > > internal implementation details like struct vm_struct to external
> > > subsystems.
> > > 
> > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > order lookup into a minimal helper instead:
> > > 
> > > unsigned int vmalloc_page_order(const void *addr){
> > > 	struct vm_struct *vm;
> > >  	vm = find_vm_area(addr);
> > > 	return vm ? vm->page_order : 0;
> > > }
> > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > 
> > > Does this approach look reasonable to you? It would keep the vm_struct
> > > layout private while satisfying the optimization needs of SMC.
> > > 
> > Could you please clarify why you need info about page_order? I have not
> > looked at your second patch.
> > 
> > Thanks!
> > 
> > --
> > Uladzislau Rezki
> 
> Hi Uladzislau,
> 
> This stems from optimizing memory registration in SMC-R. To provide the
> RDMA hardware with direct access to memory buffers, we must register
> them with the NIC. During this process, the hardware generates one MTT
> entry for each physically contiguous block. Since these hardware entries
> are a finite and scarce resource, and SMC currently defaults to a 4KB
> registration granularity, a single 2MB buffer consumes 512 entries. In
> high-concurrency scenarios, this inefficiency quickly exhausts NIC
> resources and becomes a major bottleneck for system scalability.

I believe this complexity can be avoided by using the RDMA MR pool API,
as other ULPs do, for example NVMe.

Thanks

> 
> To address this, we intend to use vmalloc_huge(). When it successfully
> allocates high-order pages, the vmalloc area is backed by a sequence of
> physically contiguous chunks (e.g., 2MB each). If we know this
> page_order, we can register these larger physical blocks instead of
> individual 4KB pages, reducing MTT consumption from 512 entries down to
> 1 for every 2MB of memory (with page_order == 9).
> 
> However, the result of vmalloc_huge() is currently opaque to the caller.
> We cannot determine whether it successfully allocated huge pages or fell
> back to 4KB pages based solely on the returned pointer. Therefore, we
> need a helper function to query the actual page order, enabling SMC-R to
> adapt its registration logic to the underlying physical layout.
> 
> I hope this clarifies our design motivation!
> 
> Best regards,
> D. Wythe
> 
> 
> 
> 
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-27 13:34           ` Leon Romanovsky
@ 2026-01-28  3:45             ` D. Wythe
  2026-01-28 11:13               ` Leon Romanovsky
  2026-01-28 18:06               ` Jason Gunthorpe
  0 siblings, 2 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-28  3:45 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton,
	Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman,
	Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390,
	netdev, oliver.yang

On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > Hello, D. Wythe!
> > > 
> > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > 
> > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > ---
> > > > > >  mm/vmalloc.c | 1 +
> > > > > >  1 file changed, 1 insertion(+)
> > > > > > 
> > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > --- a/mm/vmalloc.c
> > > > > > +++ b/mm/vmalloc.c
> > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > >  
> > > > > >  	return va->vm;
> > > > > >  }
> > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > >  
> > > > > This is internal. We can not just export it.
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > internal implementation details like struct vm_struct to external
> > > > subsystems.
> > > > 
> > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > order lookup into a minimal helper instead:
> > > > 
> > > > unsigned int vmalloc_page_order(const void *addr){
> > > > 	struct vm_struct *vm;
> > > >  	vm = find_vm_area(addr);
> > > > 	return vm ? vm->page_order : 0;
> > > > }
> > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > 
> > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > layout private while satisfying the optimization needs of SMC.
> > > > 
> > > Could you please clarify why you need info about page_order? I have not
> > > looked at your second patch.
> > > 
> > > Thanks!
> > > 
> > > --
> > > Uladzislau Rezki
> > 
> > Hi Uladzislau,
> > 
> > This stems from optimizing memory registration in SMC-R. To provide the
> > RDMA hardware with direct access to memory buffers, we must register
> > them with the NIC. During this process, the hardware generates one MTT
> > entry for each physically contiguous block. Since these hardware entries
> > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > registration granularity, a single 2MB buffer consumes 512 entries. In
> > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > resources and becomes a major bottleneck for system scalability.
> 
> I believe this complexity can be avoided by using the RDMA MR pool API,
> as other ULPs do, for example NVMe.
> 
> Thanks
> 

Hi Leon,

Am I correct in assuming you are suggesting mr_pool to limit the number
of MRs as a way to cap MTTE consumption?

However, our goal is to maximize the total registered memory within
the MTTE limits rather than to cap it. In SMC-R, each connection
occupies a configurable, fixed-size registered buffer; consequently,
the more memory we can register, the more concurrent connections
we can support.

By leveraging vmalloc_huge() and the proposed helper to increase the
page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
physical block. This significantly reduces the total number of entries
required to map the same amount of memory, allowing us to serve more
connections under the same hardware constraints

D. Wythe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28  3:45             ` D. Wythe
@ 2026-01-28 11:13               ` Leon Romanovsky
  2026-01-28 12:44                 ` D. Wythe
  2026-01-28 18:06               ` Jason Gunthorpe
  1 sibling, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-28 11:13 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > Hello, D. Wythe!
> > > > 
> > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > 
> > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > ---
> > > > > > >  mm/vmalloc.c | 1 +
> > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > 
> > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > --- a/mm/vmalloc.c
> > > > > > > +++ b/mm/vmalloc.c
> > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > >  
> > > > > > >  	return va->vm;
> > > > > > >  }
> > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > >  
> > > > > > This is internal. We can not just export it.
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > internal implementation details like struct vm_struct to external
> > > > > subsystems.
> > > > > 
> > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > order lookup into a minimal helper instead:
> > > > > 
> > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > 	struct vm_struct *vm;
> > > > >  	vm = find_vm_area(addr);
> > > > > 	return vm ? vm->page_order : 0;
> > > > > }
> > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > 
> > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > layout private while satisfying the optimization needs of SMC.
> > > > > 
> > > > Could you please clarify why you need info about page_order? I have not
> > > > looked at your second patch.
> > > > 
> > > > Thanks!
> > > > 
> > > > --
> > > > Uladzislau Rezki
> > > 
> > > Hi Uladzislau,
> > > 
> > > This stems from optimizing memory registration in SMC-R. To provide the
> > > RDMA hardware with direct access to memory buffers, we must register
> > > them with the NIC. During this process, the hardware generates one MTT
> > > entry for each physically contiguous block. Since these hardware entries
> > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > resources and becomes a major bottleneck for system scalability.
> > 
> > I believe this complexity can be avoided by using the RDMA MR pool API,
> > as other ULPs do, for example NVMe.
> > 
> > Thanks
> > 
> 
> Hi Leon,
> 
> Am I correct in assuming you are suggesting mr_pool to limit the number
> of MRs as a way to cap MTTE consumption?

I don't see this a limit, but something that is considered standard
practice to reduce MTT consumption.

> 
> However, our goal is to maximize the total registered memory within
> the MTTE limits rather than to cap it. In SMC-R, each connection
> occupies a configurable, fixed-size registered buffer; consequently,
> the more memory we can register, the more concurrent connections
> we can support.

It is not cap, but more efficient use of existing resources.

> 
> By leveraging vmalloc_huge() and the proposed helper to increase the
> page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> physical block. This significantly reduces the total number of entries
> required to map the same amount of memory, allowing us to serve more
> connections under the same hardware constraints
> 
> D. Wythe


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28 11:13               ` Leon Romanovsky
@ 2026-01-28 12:44                 ` D. Wythe
  2026-01-28 13:49                   ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-28 12:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton,
	Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman,
	Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390,
	netdev, oliver.yang

On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > Hello, D. Wythe!
> > > > > 
> > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > 
> > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > ---
> > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > 
> > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > >  
> > > > > > > >  	return va->vm;
> > > > > > > >  }
> > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > >  
> > > > > > > This is internal. We can not just export it.
> > > > > > > 
> > > > > > > --
> > > > > > > Uladzislau Rezki
> > > > > > 
> > > > > > Hi Uladzislau,
> > > > > > 
> > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > internal implementation details like struct vm_struct to external
> > > > > > subsystems.
> > > > > > 
> > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > order lookup into a minimal helper instead:
> > > > > > 
> > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > 	struct vm_struct *vm;
> > > > > >  	vm = find_vm_area(addr);
> > > > > > 	return vm ? vm->page_order : 0;
> > > > > > }
> > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > 
> > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > 
> > > > > Could you please clarify why you need info about page_order? I have not
> > > > > looked at your second patch.
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > --
> > > > > Uladzislau Rezki
> > > > 
> > > > Hi Uladzislau,
> > > > 
> > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > RDMA hardware with direct access to memory buffers, we must register
> > > > them with the NIC. During this process, the hardware generates one MTT
> > > > entry for each physically contiguous block. Since these hardware entries
> > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > resources and becomes a major bottleneck for system scalability.
> > > 
> > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > as other ULPs do, for example NVMe.
> > > 
> > > Thanks
> > > 
> > 
> > Hi Leon,
> > 
> > Am I correct in assuming you are suggesting mr_pool to limit the number
> > of MRs as a way to cap MTTE consumption?
> 
> I don't see this a limit, but something that is considered standard
> practice to reduce MTT consumption.
> 
> > 
> > However, our goal is to maximize the total registered memory within
> > the MTTE limits rather than to cap it. In SMC-R, each connection
> > occupies a configurable, fixed-size registered buffer; consequently,
> > the more memory we can register, the more concurrent connections
> > we can support.
> 
> It is not cap, but more efficient use of existing resources.

Got it. While MRs pool might be more standard practice, but it doesn't
address our specific bottleneck. In fact, smc already has its own internal
MR reuse; our core issue remains reducing MTTE consumption by increasing the
registration granularity to maximize the memory size mapped per MTT entry.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28 12:44                 ` D. Wythe
@ 2026-01-28 13:49                   ` Leon Romanovsky
  2026-01-29 11:03                     ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-28 13:49 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > Hello, D. Wythe!
> > > > > > 
> > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > 
> > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > ---
> > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > 
> > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > >  
> > > > > > > > >  	return va->vm;
> > > > > > > > >  }
> > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > >  
> > > > > > > > This is internal. We can not just export it.
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Uladzislau Rezki
> > > > > > > 
> > > > > > > Hi Uladzislau,
> > > > > > > 
> > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > subsystems.
> > > > > > > 
> > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > order lookup into a minimal helper instead:
> > > > > > > 
> > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > 	struct vm_struct *vm;
> > > > > > >  	vm = find_vm_area(addr);
> > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > }
> > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > 
> > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > 
> > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > looked at your second patch.
> > > > > > 
> > > > > > Thanks!
> > > > > > 
> > > > > > --
> > > > > > Uladzislau Rezki
> > > > > 
> > > > > Hi Uladzislau,
> > > > > 
> > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > resources and becomes a major bottleneck for system scalability.
> > > > 
> > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > as other ULPs do, for example NVMe.
> > > > 
> > > > Thanks
> > > > 
> > > 
> > > Hi Leon,
> > > 
> > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > of MRs as a way to cap MTTE consumption?
> > 
> > I don't see this a limit, but something that is considered standard
> > practice to reduce MTT consumption.
> > 
> > > 
> > > However, our goal is to maximize the total registered memory within
> > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > occupies a configurable, fixed-size registered buffer; consequently,
> > > the more memory we can register, the more concurrent connections
> > > we can support.
> > 
> > It is not cap, but more efficient use of existing resources.
> 
> Got it. While MRs pool might be more standard practice, but it doesn't
> address our specific bottleneck. In fact, smc already has its own internal
> MR reuse; our core issue remains reducing MTTE consumption by increasing the
> registration granularity to maximize the memory size mapped per MTT entry.

And this is something MR pools can handle as well. We are going in circles,
so let's summarize.

I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
existing ULP API used by NVMe, NFS, and others, rather than maintaining its
own internal logic.

I also do not know whether vmalloc_page_order() is an appropriate solution;
I only want to show that we can probably achieve the same result without
introducing a new function.

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28 13:49                   ` Leon Romanovsky
@ 2026-01-29 11:03                     ` D. Wythe
  2026-01-29 12:22                       ` Leon Romanovsky
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-29 11:03 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton,
	Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman,
	Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390,
	netdev, oliver.yang

On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > Hello, D. Wythe!
> > > > > > > 
> > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > 
> > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > ---
> > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > 
> > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > >  
> > > > > > > > > >  	return va->vm;
> > > > > > > > > >  }
> > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > >  
> > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Uladzislau Rezki
> > > > > > > > 
> > > > > > > > Hi Uladzislau,
> > > > > > > > 
> > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > subsystems.
> > > > > > > > 
> > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > 
> > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > 	struct vm_struct *vm;
> > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > }
> > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > 
> > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > 
> > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > looked at your second patch.
> > > > > > > 
> > > > > > > Thanks!
> > > > > > > 
> > > > > > > --
> > > > > > > Uladzislau Rezki
> > > > > > 
> > > > > > Hi Uladzislau,
> > > > > > 
> > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > 
> > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > as other ULPs do, for example NVMe.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > 
> > > > Hi Leon,
> > > > 
> > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > of MRs as a way to cap MTTE consumption?
> > > 
> > > I don't see this a limit, but something that is considered standard
> > > practice to reduce MTT consumption.
> > > 
> > > > 
> > > > However, our goal is to maximize the total registered memory within
> > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > the more memory we can register, the more concurrent connections
> > > > we can support.
> > > 
> > > It is not cap, but more efficient use of existing resources.
> > 
> > Got it. While MRs pool might be more standard practice, but it doesn't
> > address our specific bottleneck. In fact, smc already has its own internal
> > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > registration granularity to maximize the memory size mapped per MTT entry.
> 
> And this is something MR pools can handle as well. We are going in circles,
> so let's summarize.

I believe some points need to be thoroughly clarified here:

> 
> I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> own internal logic.

SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
already planned a gradual migration to the updated RDMA APIs. We are
currently in the process of adapting to ib_cqe, for instance. As long as
functionality remains intact, there is no reason to oppose changes that
reduce maintenance overhead or provide additional gains, but such a
transition takes time.

> 
> I also do not know whether vmalloc_page_order() is an appropriate solution;
> I only want to show that we can probably achieve the same result without
> introducing a new function.

Regarding the specific issue under discussion, I believe the newer RDMA
APIs you mentioned do not solve my problem, at least for now. My
understanding is that regardless of how MRs are pooled, the core
requirement is to increase the page_size parameter in ib_map_mr_sg to
maximize the physical size mapped per MTTE. From the code I have
examined, I see no evidence of these new APIs utilizing values other
than 4KB.

Of course, I believe that regardless of whether this issue
currently exists, it is something the RDMA community can resolve.
However, as I mentioned, adapting to new API takes time. Before a
complete transition is achieved, we need to allow for some necessary
updates to SMC.

Thanks


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-29 11:03                     ` D. Wythe
@ 2026-01-29 12:22                       ` Leon Romanovsky
  2026-01-29 14:04                         ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Leon Romanovsky @ 2026-01-29 12:22 UTC (permalink / raw)
  To: D. Wythe
  Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li,
	Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote:
> On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > > Hello, D. Wythe!
> > > > > > > > 
> > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > > 
> > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > > ---
> > > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > 
> > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > > >  
> > > > > > > > > > >  	return va->vm;
> > > > > > > > > > >  }
> > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > > >  
> > > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > > 
> > > > > > > > > > --
> > > > > > > > > > Uladzislau Rezki
> > > > > > > > > 
> > > > > > > > > Hi Uladzislau,
> > > > > > > > > 
> > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > > subsystems.
> > > > > > > > > 
> > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > > 
> > > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > > 	struct vm_struct *vm;
> > > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > > }
> > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > > 
> > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > > 
> > > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > > looked at your second patch.
> > > > > > > > 
> > > > > > > > Thanks!
> > > > > > > > 
> > > > > > > > --
> > > > > > > > Uladzislau Rezki
> > > > > > > 
> > > > > > > Hi Uladzislau,
> > > > > > > 
> > > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > > 
> > > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > > as other ULPs do, for example NVMe.
> > > > > > 
> > > > > > Thanks
> > > > > > 
> > > > > 
> > > > > Hi Leon,
> > > > > 
> > > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > > of MRs as a way to cap MTTE consumption?
> > > > 
> > > > I don't see this a limit, but something that is considered standard
> > > > practice to reduce MTT consumption.
> > > > 
> > > > > 
> > > > > However, our goal is to maximize the total registered memory within
> > > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > > the more memory we can register, the more concurrent connections
> > > > > we can support.
> > > > 
> > > > It is not cap, but more efficient use of existing resources.
> > > 
> > > Got it. While MRs pool might be more standard practice, but it doesn't
> > > address our specific bottleneck. In fact, smc already has its own internal
> > > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > > registration granularity to maximize the memory size mapped per MTT entry.
> > 
> > And this is something MR pools can handle as well. We are going in circles,
> > so let's summarize.
> 
> I believe some points need to be thoroughly clarified here:
> 
> > 
> > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> > existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> > own internal logic.
> 
> SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
> already planned a gradual migration to the updated RDMA APIs. We are
> currently in the process of adapting to ib_cqe, for instance. As long as
> functionality remains intact, there is no reason to oppose changes that
> reduce maintenance overhead or provide additional gains, but such a
> transition takes time.
> 
> > 
> > I also do not know whether vmalloc_page_order() is an appropriate solution;
> > I only want to show that we can probably achieve the same result without
> > introducing a new function.
> 
> Regarding the specific issue under discussion, I believe the newer RDMA
> APIs you mentioned do not solve my problem, at least for now. My
> understanding is that regardless of how MRs are pooled, the core
> requirement is to increase the page_size parameter in ib_map_mr_sg to
> maximize the physical size mapped per MTTE. From the code I have
> examined, I see no evidence of these new APIs utilizing values other
> than 4KB.
> 
> Of course, I believe that regardless of whether this issue
> currently exists, it is something the RDMA community can resolve.
> However, as I mentioned, adapting to new API takes time. Before a
> complete transition is achieved, we need to allow for some necessary
> updates to SMC.

I disagree with that statement.

SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and
not always correctly.

https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/
https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t

Thanks

> 
> Thanks
> 


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-29 12:22                       ` Leon Romanovsky
@ 2026-01-29 14:04                         ` D. Wythe
  0 siblings, 0 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-29 14:04 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton,
	Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman,
	Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390,
	netdev, oliver.yang

On Thu, Jan 29, 2026 at 02:22:02PM +0200, Leon Romanovsky wrote:
> On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote:
> > On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote:
> > > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote:
> > > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote:
> > > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> > > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote:
> > > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote:
> > > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote:
> > > > > > > > > Hello, D. Wythe!
> > > > > > > > > 
> > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote:
> > > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote:
> > > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a
> > > > > > > > > > > > virtual address. Export this symbol to modules so that modularized
> > > > > > > > > > > > subsystems can perform lookups on vmalloc addresses.
> > > > > > > > > > > > 
> > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
> > > > > > > > > > > > ---
> > > > > > > > > > > >  mm/vmalloc.c | 1 +
> > > > > > > > > > > >  1 file changed, 1 insertion(+)
> > > > > > > > > > > > 
> > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644
> > > > > > > > > > > > --- a/mm/vmalloc.c
> > > > > > > > > > > > +++ b/mm/vmalloc.c
> > > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr)
> > > > > > > > > > > >  
> > > > > > > > > > > >  	return va->vm;
> > > > > > > > > > > >  }
> > > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area);
> > > > > > > > > > > >  
> > > > > > > > > > > This is internal. We can not just export it.
> > > > > > > > > > > 
> > > > > > > > > > > --
> > > > > > > > > > > Uladzislau Rezki
> > > > > > > > > > 
> > > > > > > > > > Hi Uladzislau,
> > > > > > > > > > 
> > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing
> > > > > > > > > > internal implementation details like struct vm_struct to external
> > > > > > > > > > subsystems.
> > > > > > > > > > 
> > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page
> > > > > > > > > > order lookup into a minimal helper instead:
> > > > > > > > > > 
> > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){
> > > > > > > > > > 	struct vm_struct *vm;
> > > > > > > > > >  	vm = find_vm_area(addr);
> > > > > > > > > > 	return vm ? vm->page_order : 0;
> > > > > > > > > > }
> > > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order);
> > > > > > > > > > 
> > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct
> > > > > > > > > > layout private while satisfying the optimization needs of SMC.
> > > > > > > > > > 
> > > > > > > > > Could you please clarify why you need info about page_order? I have not
> > > > > > > > > looked at your second patch.
> > > > > > > > > 
> > > > > > > > > Thanks!
> > > > > > > > > 
> > > > > > > > > --
> > > > > > > > > Uladzislau Rezki
> > > > > > > > 
> > > > > > > > Hi Uladzislau,
> > > > > > > > 
> > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the
> > > > > > > > RDMA hardware with direct access to memory buffers, we must register
> > > > > > > > them with the NIC. During this process, the hardware generates one MTT
> > > > > > > > entry for each physically contiguous block. Since these hardware entries
> > > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB
> > > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In
> > > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC
> > > > > > > > resources and becomes a major bottleneck for system scalability.
> > > > > > > 
> > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API,
> > > > > > > as other ULPs do, for example NVMe.
> > > > > > > 
> > > > > > > Thanks
> > > > > > > 
> > > > > > 
> > > > > > Hi Leon,
> > > > > > 
> > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number
> > > > > > of MRs as a way to cap MTTE consumption?
> > > > > 
> > > > > I don't see this a limit, but something that is considered standard
> > > > > practice to reduce MTT consumption.
> > > > > 
> > > > > > 
> > > > > > However, our goal is to maximize the total registered memory within
> > > > > > the MTTE limits rather than to cap it. In SMC-R, each connection
> > > > > > occupies a configurable, fixed-size registered buffer; consequently,
> > > > > > the more memory we can register, the more concurrent connections
> > > > > > we can support.
> > > > > 
> > > > > It is not cap, but more efficient use of existing resources.
> > > > 
> > > > Got it. While MRs pool might be more standard practice, but it doesn't
> > > > address our specific bottleneck. In fact, smc already has its own internal
> > > > MR reuse; our core issue remains reducing MTTE consumption by increasing the
> > > > registration granularity to maximize the memory size mapped per MTT entry.
> > > 
> > > And this is something MR pools can handle as well. We are going in circles,
> > > so let's summarize.
> > 
> > I believe some points need to be thoroughly clarified here:
> > 
> > > 
> > > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the
> > > existing ULP API used by NVMe, NFS, and others, rather than maintaining its
> > > own internal logic.
> > 
> > SMC is not opposed to adopting newer RDMA interfaces; in fact, I have
> > already planned a gradual migration to the updated RDMA APIs. We are
> > currently in the process of adapting to ib_cqe, for instance. As long as
> > functionality remains intact, there is no reason to oppose changes that
> > reduce maintenance overhead or provide additional gains, but such a
> > transition takes time.
> > 
> > > 
> > > I also do not know whether vmalloc_page_order() is an appropriate solution;
> > > I only want to show that we can probably achieve the same result without
> > > introducing a new function.
> > 
> > Regarding the specific issue under discussion, I believe the newer RDMA
> > APIs you mentioned do not solve my problem, at least for now. My
> > understanding is that regardless of how MRs are pooled, the core
> > requirement is to increase the page_size parameter in ib_map_mr_sg to
> > maximize the physical size mapped per MTTE. From the code I have
> > examined, I see no evidence of these new APIs utilizing values other
> > than 4KB.
> > 
> > Of course, I believe that regardless of whether this issue
> > currently exists, it is something the RDMA community can resolve.
> > However, as I mentioned, adapting to new API takes time. Before a
> > complete transition is achieved, we need to allow for some necessary
> > updates to SMC.
> 
> I disagree with that statement.
> 
> SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and
> not always correctly.
> 
> https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/
> https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t
>

I see that this discussion has moved beyond the technical scope of the
patch into historical design critiques. I do not wish to engage in a
debate over SMC's history, nor am I responsible for those past
decisions.

I will discontinue the conversation here.

Thanks.


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28  3:45             ` D. Wythe
  2026-01-28 11:13               ` Leon Romanovsky
@ 2026-01-28 18:06               ` Jason Gunthorpe
  2026-01-29 11:36                 ` D. Wythe
  1 sibling, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-01-28 18:06 UTC (permalink / raw)
  To: D. Wythe
  Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:

> By leveraging vmalloc_huge() and the proposed helper to increase the
> page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> physical block.

This doesn't seem right, if your goal is to take a vmalloc() pointer
and convert it to a MR via a scatterlist and ib_map_mr_sg() then you
should be asking for a helper to convert a kernel pointer into a
scatterlist.

Even if you do this in a naive way and call the
sg_alloc_append_table_from_pages() function it will automatically join
physically contiguous ranges together for you.

From there you can check the resulting scatterlist and compute the
page_size to pass to ib_map_mr_sg().

No need to ask the MM for anything other than the list of physicals to
build the scatterlist with.

Still, I wouldn't mind seeing a helper to convert a kernel pointer
into a scatterlist because I have see that opencoded in a few places,
and maybe there are ways to optimize that using more information from
the MM - but it should be APIs used only by this helper not exposed to
drivers.

Jason

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-28 18:06               ` Jason Gunthorpe
@ 2026-01-29 11:36                 ` D. Wythe
  2026-01-29 13:20                   ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-29 11:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Wed, Jan 28, 2026 at 02:06:29PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote:
> 
> > By leveraging vmalloc_huge() and the proposed helper to increase the
> > page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous
> > physical block.
> 
> This doesn't seem right, if your goal is to take a vmalloc() pointer
> and convert it to a MR via a scatterlist and ib_map_mr_sg() then you
> should be asking for a helper to convert a kernel pointer into a
> scatterlist.
> 
> Even if you do this in a naive way and call the
> sg_alloc_append_table_from_pages() function it will automatically join
> physically contiguous ranges together for you.
> 
> From there you can check the resulting scatterlist and compute the
> page_size to pass to ib_map_mr_sg().
> 
> No need to ask the MM for anything other than the list of physicals to
> build the scatterlist with.
> 
> Still, I wouldn't mind seeing a helper to convert a kernel pointer
> into a scatterlist because I have see that opencoded in a few places,
> and maybe there are ways to optimize that using more information from
> the MM - but it should be APIs used only by this helper not exposed to
> drivers.
> 
> Jason

Hi Jason,

To be honest, I was previously unaware of the
sg_alloc_append_table_from_pages() function, although I had indeed
considered manually calculating the size of contiguous physical blocks.
The reason I proposed the MM helper is that SMC is not driver, it  utilizes
vmalloc() for memory allocation and is thus in direct contact with the MM;
from this perspective, having the MM provide the page_order would be the most
straightforward approach.

Given the significant opposition and our plans to transition SMC to newer
APIs in the future anyway, I agree that introducing this helper now is
less justified.

I will follow your suggestion and update the next version accordingly.

Thanks.




^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-29 11:36                 ` D. Wythe
@ 2026-01-29 13:20                   ` Jason Gunthorpe
  2026-01-30  8:51                     ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-01-29 13:20 UTC (permalink / raw)
  To: D. Wythe
  Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:

> > From there you can check the resulting scatterlist and compute the
> > page_size to pass to ib_map_mr_sg().

I should clarify this is done after DMA mapping the scatterlist. dma
mapping can improve the page size.

And maybe the core code should be helping compute the MR's target page
size for a scatterlist.. We already have code to do this in umem, and
it is a pretty bit tricky considering the IOVA related rules.

Jason


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-29 13:20                   ` Jason Gunthorpe
@ 2026-01-30  8:51                     ` D. Wythe
  2026-01-30 15:16                       ` Jason Gunthorpe
  0 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-30  8:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> 
> > > From there you can check the resulting scatterlist and compute the
> > > page_size to pass to ib_map_mr_sg().
> 
> I should clarify this is done after DMA mapping the scatterlist. dma
> mapping can improve the page size.
> 
> And maybe the core code should be helping compute the MR's target page
> size for a scatterlist.. We already have code to do this in umem, and
> it is a pretty bit tricky considering the IOVA related rules.
>

Hi Jason,

After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
much more subtle than it first appears. The IOVA-to-PA relative offset
rules, in particular, make it quite easy to get wrong.

While SMC could duplicate this logic, it is certainly not ideal for
maintenance. Are there any plans to refactor this into a generic RDMA
core helper—for instance, one that can determine the best page size
directly from an sg_table or scatterlist?

Best regards,
D. Wythe



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-30  8:51                     ` D. Wythe
@ 2026-01-30 15:16                       ` Jason Gunthorpe
  2026-02-03  9:14                         ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Jason Gunthorpe @ 2026-01-30 15:16 UTC (permalink / raw)
  To: D. Wythe
  Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote:
> On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> > 
> > > > From there you can check the resulting scatterlist and compute the
> > > > page_size to pass to ib_map_mr_sg().
> > 
> > I should clarify this is done after DMA mapping the scatterlist. dma
> > mapping can improve the page size.
> > 
> > And maybe the core code should be helping compute the MR's target page
> > size for a scatterlist.. We already have code to do this in umem, and
> > it is a pretty bit tricky considering the IOVA related rules.
> >
> 
> Hi Jason,
> 
> After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
> much more subtle than it first appears. The IOVA-to-PA relative offset
> rules, in particular, make it quite easy to get wrong.
> 
> While SMC could duplicate this logic, it is certainly not ideal for
> maintenance. Are there any plans to refactor this into a generic RDMA
> core helper—for instance, one that can determine the best page size
> directly from an sg_table or scatterlist?

I have not heard of anyone touching this.

It looks like there are only two users in the kernel that pass
something other than PAGE_SIZE, so it seems nobody has cared about
this till now.

With high order folios being more common it seems like something
missing.

However, I wonder what the drivers do with the input page size, 
segmenting a scatterlist is a bit hard and we have helpers for that
already too.

It is a bigger project but probably the right thing is to remove the
page size input, wrap the scatterlist in a umem and fixup the drivers
to use the existing umem support for building mtts, splitting
scatterlists into blocks and so on.

The kernel side here has been left alone for a long time..

Jason


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area()
  2026-01-30 15:16                       ` Jason Gunthorpe
@ 2026-02-03  9:14                         ` D. Wythe
  0 siblings, 0 replies; 30+ messages in thread
From: D. Wythe @ 2026-02-03  9:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller,
	Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi,
	Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm,
	linux-rdma, linux-s390, netdev, oliver.yang

On Fri, Jan 30, 2026 at 11:16:36AM -0400, Jason Gunthorpe wrote:
> On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote:
> > On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote:
> > > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote:
> > > 
> > > > > From there you can check the resulting scatterlist and compute the
> > > > > page_size to pass to ib_map_mr_sg().
> > > 
> > > I should clarify this is done after DMA mapping the scatterlist. dma
> > > mapping can improve the page size.
> > > 
> > > And maybe the core code should be helping compute the MR's target page
> > > size for a scatterlist.. We already have code to do this in umem, and
> > > it is a pretty bit tricky considering the IOVA related rules.
> > >
> > 
> > Hi Jason,
> > 
> > After a deep dive into ib_umem_find_best_pgsz(), I have to say it is
> > much more subtle than it first appears. The IOVA-to-PA relative offset
> > rules, in particular, make it quite easy to get wrong.
> > 
> > While SMC could duplicate this logic, it is certainly not ideal for
> > maintenance. Are there any plans to refactor this into a generic RDMA
> > core helper—for instance, one that can determine the best page size
> > directly from an sg_table or scatterlist?
> 
> I have not heard of anyone touching this.
> 
> It looks like there are only two users in the kernel that pass
> something other than PAGE_SIZE, so it seems nobody has cared about
> this till now.
> 
> With high order folios being more common it seems like something
> missing.
> 
> However, I wonder what the drivers do with the input page size, 
> segmenting a scatterlist is a bit hard and we have helpers for that
> already too.
> 
> It is a bigger project but probably the right thing is to remove the
> page size input, wrap the scatterlist in a umem and fixup the drivers
> to use the existing umem support for building mtts, splitting
> scatterlists into blocks and so on.
> 
> The kernel side here has been left alone for a long time..

I am also curious about the original design intent behind requiring the 
caller to explicitly pass `page_size`. From what I can see, its primary 
role is to define the memory size per MTTE, but calculating the optimal 
value is surprisingly complex.

I completely agree that providing an automatic way to optimize or 
calculate the best page size should be the responsibility of the drivers
or the RDMA core themselves. Handling such low-level hardware-related 
details in a ULP like SMC feels misplaced.

Since it appears this isn't a high-priority issue for the community at
the moment, and a proper fix requires a much larger architectural effort 
in the RDMA core, I will withdraw this patch series. 

I'll keep an eye on the RDMA subsystem's progress and see if a more 
generic solution emerges in the future.

Thanks,
D. Wythe




^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers
  2026-01-23  8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe
  2026-01-23  8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
  2026-01-23  8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe
@ 2026-01-23  8:23 ` D. Wythe
  2026-01-23 14:52   ` Christoph Hellwig
  2 siblings, 1 reply; 30+ messages in thread
From: D. Wythe @ 2026-01-23  8:23 UTC (permalink / raw)
  To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang
  Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
	linux-mm, linux-rdma, linux-s390, netdev, oliver.yang

SMC-R buffers currently use 4KB page mapping for IB registration.
Each page consumes one MTTE, which is inefficient and quickly depletes
limited IB hardware resources for large buffers.

For virtual contiguous buffer, switch to vmalloc_huge() to leverage
huge page support. By using larger page sizes during IB MR registration,
we can drastically reduce MTTE consumption.

For physically contiguous buffer, the entire buffer now requires only
one single MTTE.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
---
 net/smc/smc_core.c |  3 ++-
 net/smc/smc_ib.c   | 23 ++++++++++++++++++++---
 2 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 6219db498976..8aca5dc54be7 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2348,7 +2348,8 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
 			goto out;
 		fallthrough;	// try virtually contiguous buf
 	case SMCR_VIRT_CONT_BUFS:
-		buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order);
+		buf_desc->cpu_addr = vmalloc_huge(PAGE_SIZE << buf_desc->order,
+						  GFP_KERNEL | __GFP_ZERO);
 		if (!buf_desc->cpu_addr)
 			goto out;
 		buf_desc->pages = NULL;
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 1154907c5c05..67211d44a1db 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -20,6 +20,7 @@
 #include <linux/wait.h>
 #include <linux/mutex.h>
 #include <linux/inetdevice.h>
+#include <linux/vmalloc.h>
 #include <rdma/ib_verbs.h>
 #include <rdma/ib_cache.h>
 
@@ -697,6 +698,18 @@ void smc_ib_put_memory_region(struct ib_mr *mr)
 	ib_dereg_mr(mr);
 }
 
+static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot)
+{
+#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
+	struct vm_struct *vm;
+
+	vm = find_vm_area(buf_slot->cpu_addr);
+	return vm ? vm->page_order : 0;
+#else
+	return 0;
+#endif
+}
+
 static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx)
 {
 	unsigned int offset = 0;
@@ -706,8 +719,9 @@ static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx)
 	sg_num = ib_map_mr_sg(buf_slot->mr[link_idx],
 			      buf_slot->sgt[link_idx].sgl,
 			      buf_slot->sgt[link_idx].orig_nents,
-			      &offset, PAGE_SIZE);
-
+			      &offset,
+			      buf_slot->is_vm ? PAGE_SIZE << smc_buf_get_vm_page_order(buf_slot) :
+			      PAGE_SIZE << buf_slot->order);
 	return sg_num;
 }
 
@@ -719,7 +733,10 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
 		return 0; /* already done */
 
 	buf_slot->mr[link_idx] =
-		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order);
+		ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG,
+			    buf_slot->is_vm ?
+			    1 << (buf_slot->order - smc_buf_get_vm_page_order(buf_slot)) : 1);
+
 	if (IS_ERR(buf_slot->mr[link_idx])) {
 		int rc;
 
-- 
2.45.0



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers
  2026-01-23  8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe
@ 2026-01-23 14:52   ` Christoph Hellwig
  2026-01-24  9:25     ` D. Wythe
  0 siblings, 1 reply; 30+ messages in thread
From: Christoph Hellwig @ 2026-01-23 14:52 UTC (permalink / raw)
  To: D. Wythe
  Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Fri, Jan 23, 2026 at 04:23:49PM +0800, D. Wythe wrote:
> +static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot)
> +{
> +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> +	struct vm_struct *vm;
> +
> +	vm = find_vm_area(buf_slot->cpu_addr);
> +	return vm ? vm->page_order : 0;
> +#else
> +	return 0;
> +#endif

You might want to encapsulate this logic in a vmalloc_order or similar
helper in vmalloc.c.



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers
  2026-01-23 14:52   ` Christoph Hellwig
@ 2026-01-24  9:25     ` D. Wythe
  0 siblings, 0 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-24  9:25 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
	Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu,
	linux-kernel, linux-mm, linux-rdma, linux-s390, netdev,
	oliver.yang

On Fri, Jan 23, 2026 at 06:52:55AM -0800, Christoph Hellwig wrote:
> On Fri, Jan 23, 2026 at 04:23:49PM +0800, D. Wythe wrote:
> > +static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot)
> > +{
> > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC
> > +	struct vm_struct *vm;
> > +
> > +	vm = find_vm_area(buf_slot->cpu_addr);
> > +	return vm ? vm->page_order : 0;
> > +#else
> > +	return 0;
> > +#endif
> 
> You might want to encapsulate this logic in a vmalloc_order or similar
> helper in vmalloc.c.

Hi Christoph,

That's a great suggestion. Encapsulating this logic into a helper like
vmalloc_page_order() (or similar) within vmalloc.c is indeed much
cleaner than exporting find_vm_area().

I'll implement this helper in V2 and use it in the SMC code. Thanks for
pointing this out!

Thanks,
D. Wythe


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2026-02-03  9:15 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-01-23  8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe
2026-01-23  8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
2026-01-23 10:54   ` Alexandra Winter
2026-01-24  9:22     ` D. Wythe
2026-01-23  8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe
2026-01-23 14:44   ` Christoph Hellwig
2026-01-23 18:55   ` Uladzislau Rezki
2026-01-24  9:35     ` D. Wythe
2026-01-24 10:48       ` Uladzislau Rezki
2026-01-24 14:57         ` D. Wythe
2026-01-26 10:28           ` Uladzislau Rezki
2026-01-26 12:02             ` D. Wythe
2026-01-26 16:45               ` Uladzislau Rezki
2026-01-27 13:34           ` Leon Romanovsky
2026-01-28  3:45             ` D. Wythe
2026-01-28 11:13               ` Leon Romanovsky
2026-01-28 12:44                 ` D. Wythe
2026-01-28 13:49                   ` Leon Romanovsky
2026-01-29 11:03                     ` D. Wythe
2026-01-29 12:22                       ` Leon Romanovsky
2026-01-29 14:04                         ` D. Wythe
2026-01-28 18:06               ` Jason Gunthorpe
2026-01-29 11:36                 ` D. Wythe
2026-01-29 13:20                   ` Jason Gunthorpe
2026-01-30  8:51                     ` D. Wythe
2026-01-30 15:16                       ` Jason Gunthorpe
2026-02-03  9:14                         ` D. Wythe
2026-01-23  8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe
2026-01-23 14:52   ` Christoph Hellwig
2026-01-24  9:25     ` D. Wythe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox