* [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements
@ 2026-01-23 8:23 D. Wythe
2026-01-23 8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: D. Wythe @ 2026-01-23 8:23 UTC (permalink / raw)
To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki,
Wenjia Zhang
Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel,
linux-mm, linux-rdma, linux-s390, netdev, oliver.yang
This series improves SMC-R buffer management by refining the allocation
logic and reducing hardware resource overhead during registration.
The primary improvement is the significant reduction in MTTE consumption.
By aligning IB registration with the actual physical block sizes, we can
reduce the entry count from one per 4KB page to just one per contiguous
block. This is especially beneficial for large buffers, preventing
hardware resource exhaustion on RDMA NICs.
D. Wythe (3):
net/smc: cap allocation order for SMC-R physically contiguous buffers
mm: vmalloc: export find_vm_area()
net/smc: optimize MTTE consumption for SMC-R buffers
mm/vmalloc.c | 1 +
net/smc/smc_core.c | 31 ++++++++++++++++++-------------
net/smc/smc_ib.c | 23 ++++++++++++++++++++---
3 files changed, 39 insertions(+), 16 deletions(-)
--
2.45.0
^ permalink raw reply [flat|nested] 30+ messages in thread* [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers 2026-01-23 8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe @ 2026-01-23 8:23 ` D. Wythe 2026-01-23 10:54 ` Alexandra Winter 2026-01-23 8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe 2026-01-23 8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe 2 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-23 8:23 UTC (permalink / raw) To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang The alloc_page() cannot satisfy requests exceeding MAX_PAGE_ORDER, and attempting such allocations will lead to guaranteed failures and potential kernel warnings. For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible physically contiguous chunk instead of failing with an invalid order, which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create(). For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the doomed physical allocation attempt and fallback to virtual memory immediately. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> --- net/smc/smc_core.c | 28 ++++++++++++++++------------ 1 file changed, 16 insertions(+), 12 deletions(-) diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index e4eabc83719e..6219db498976 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -2324,26 +2324,30 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, if (!buf_desc) return ERR_PTR(-ENOMEM); + buf_desc->order = get_order(bufsize); + switch (lgr->buf_type) { case SMCR_PHYS_CONT_BUFS: + buf_desc->order = min(buf_desc->order, MAX_PAGE_ORDER); + fallthrough; case SMCR_MIXED_BUFS: - buf_desc->order = get_order(bufsize); - buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | - __GFP_NOMEMALLOC | __GFP_COMP | - __GFP_NORETRY | __GFP_ZERO, - buf_desc->order); - if (buf_desc->pages) { - buf_desc->cpu_addr = - (void *)page_address(buf_desc->pages); - buf_desc->len = bufsize; - buf_desc->is_vm = false; - break; + if (buf_desc->order <= MAX_PAGE_ORDER) { + buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN | + __GFP_NOMEMALLOC | __GFP_COMP | + __GFP_NORETRY | __GFP_ZERO, + buf_desc->order); + if (buf_desc->pages) { + buf_desc->cpu_addr = + (void *)page_address(buf_desc->pages); + buf_desc->len = bufsize; + buf_desc->is_vm = false; + break; + } } if (lgr->buf_type == SMCR_PHYS_CONT_BUFS) goto out; fallthrough; // try virtually contiguous buf case SMCR_VIRT_CONT_BUFS: - buf_desc->order = get_order(bufsize); buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order); if (!buf_desc->cpu_addr) goto out; -- 2.45.0 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers 2026-01-23 8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe @ 2026-01-23 10:54 ` Alexandra Winter 2026-01-24 9:22 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Alexandra Winter @ 2026-01-23 10:54 UTC (permalink / raw) To: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On 23.01.26 09:23, D. Wythe wrote: Describe your changes in imperative mood. https://docs.kernel.org/process/submitting-patches.html#describe-your-changes > For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to > MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible > physically contiguous chunk instead of failing with an invalid order, > which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create(). > > For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the > doomed physical allocation attempt and fallback to virtual memory > immediately. > Proposal for a version in imperative mood (iiuc): " For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER. This ensures the attempts to allocate the largest possible physically contiguous chunk succeed, instead of failing with an invalid order. This also avoids redundant "try-fail-degrade" cycles in __smc_buf_create(). For SMCR_MIXED_BUFS, if its order exceeds MAX_PAGE_ORDER, skip the doomed physical allocation attempt and fallback to virtual memory immediately. " > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Other than that: LGTM Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers 2026-01-23 10:54 ` Alexandra Winter @ 2026-01-24 9:22 ` D. Wythe 0 siblings, 0 replies; 30+ messages in thread From: D. Wythe @ 2026-01-24 9:22 UTC (permalink / raw) To: Alexandra Winter Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 11:54:37AM +0100, Alexandra Winter wrote: > > > On 23.01.26 09:23, D. Wythe wrote: > Describe your changes in imperative mood. > https://docs.kernel.org/process/submitting-patches.html#describe-your-changes > > > > For SMCR_PHYS_CONT_BUFS, the allocation order is now capped to > > MAX_PAGE_ORDER, ensures the attempts to allocate the largest possible > > physically contiguous chunk instead of failing with an invalid order, > > which also avoid redundant "try-fail-degrade" cycles in __smc_buf_create(). > > > > For SMCR_MIXED_BUFS, If it's order exceeds MAX_PAGE_ORDER, skips the > > doomed physical allocation attempt and fallback to virtual memory > > immediately. > > > > Proposal for a version in imperative mood (iiuc): > " > For SMCR_PHYS_CONT_BUFS, cap the allocation order to MAX_PAGE_ORDER. This > ensures the attempts to allocate the largest possible physically contiguous > chunk succeed, instead of failing with an invalid order. This also avoids > redundant "try-fail-degrade" cycles in __smc_buf_create(). > > For SMCR_MIXED_BUFS, if its order exceeds MAX_PAGE_ORDER, skip the doomed > physical allocation attempt and fallback to virtual memory immediately. > " > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > Reviewed-by: Dust Li <dust.li@linux.alibaba.com> > > Other than that: LGTM > Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Hi Alexandra, Thank you for your review and for providing the refined description. I will use your suggested wording for the commit message in V2. Best regards, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-23 8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe 2026-01-23 8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe @ 2026-01-23 8:23 ` D. Wythe 2026-01-23 14:44 ` Christoph Hellwig 2026-01-23 18:55 ` Uladzislau Rezki 2026-01-23 8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe 2 siblings, 2 replies; 30+ messages in thread From: D. Wythe @ 2026-01-23 8:23 UTC (permalink / raw) To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang find_vm_area() provides a way to find the vm_struct associated with a virtual address. Export this symbol to modules so that modularized subsystems can perform lookups on vmalloc addresses. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> --- mm/vmalloc.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/vmalloc.c b/mm/vmalloc.c index ecbac900c35f..3eb9fe761c34 100644 --- a/mm/vmalloc.c +++ b/mm/vmalloc.c @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) return va->vm; } +EXPORT_SYMBOL_GPL(find_vm_area); /** * remove_vm_area - find and remove a continuous kernel virtual area -- 2.45.0 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-23 8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe @ 2026-01-23 14:44 ` Christoph Hellwig 2026-01-23 18:55 ` Uladzislau Rezki 1 sibling, 0 replies; 30+ messages in thread From: Christoph Hellwig @ 2026-01-23 14:44 UTC (permalink / raw) To: D. Wythe Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > find_vm_area() provides a way to find the vm_struct associated with a > virtual address. Export this symbol to modules so that modularized > subsystems can perform lookups on vmalloc addresses. No, they have absolutely no business doing that. This functionality is very intentionally kept private. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-23 8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe 2026-01-23 14:44 ` Christoph Hellwig @ 2026-01-23 18:55 ` Uladzislau Rezki 2026-01-24 9:35 ` D. Wythe 1 sibling, 1 reply; 30+ messages in thread From: Uladzislau Rezki @ 2026-01-23 18:55 UTC (permalink / raw) To: D. Wythe Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > find_vm_area() provides a way to find the vm_struct associated with a > virtual address. Export this symbol to modules so that modularized > subsystems can perform lookups on vmalloc addresses. > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > --- > mm/vmalloc.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > index ecbac900c35f..3eb9fe761c34 100644 > --- a/mm/vmalloc.c > +++ b/mm/vmalloc.c > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > return va->vm; > } > +EXPORT_SYMBOL_GPL(find_vm_area); > This is internal. We can not just export it. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-23 18:55 ` Uladzislau Rezki @ 2026-01-24 9:35 ` D. Wythe 2026-01-24 10:48 ` Uladzislau Rezki 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-24 9:35 UTC (permalink / raw) To: Uladzislau Rezki Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > find_vm_area() provides a way to find the vm_struct associated with a > > virtual address. Export this symbol to modules so that modularized > > subsystems can perform lookups on vmalloc addresses. > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > --- > > mm/vmalloc.c | 1 + > > 1 file changed, 1 insertion(+) > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > index ecbac900c35f..3eb9fe761c34 100644 > > --- a/mm/vmalloc.c > > +++ b/mm/vmalloc.c > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > return va->vm; > > } > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > This is internal. We can not just export it. > > -- > Uladzislau Rezki Hi Uladzislau, Thank you for the feedback. I agree that we should avoid exposing internal implementation details like struct vm_struct to external subsystems. Following Christoph's suggestion, I'm planning to encapsulate the page order lookup into a minimal helper instead: unsigned int vmalloc_page_order(const void *addr){ struct vm_struct *vm; vm = find_vm_area(addr); return vm ? vm->page_order : 0; } EXPORT_SYMBOL_GPL(vmalloc_page_order); Does this approach look reasonable to you? It would keep the vm_struct layout private while satisfying the optimization needs of SMC. Thanks, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-24 9:35 ` D. Wythe @ 2026-01-24 10:48 ` Uladzislau Rezki 2026-01-24 14:57 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Uladzislau Rezki @ 2026-01-24 10:48 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang Hello, D. Wythe! > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > find_vm_area() provides a way to find the vm_struct associated with a > > > virtual address. Export this symbol to modules so that modularized > > > subsystems can perform lookups on vmalloc addresses. > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > --- > > > mm/vmalloc.c | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > index ecbac900c35f..3eb9fe761c34 100644 > > > --- a/mm/vmalloc.c > > > +++ b/mm/vmalloc.c > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > return va->vm; > > > } > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > This is internal. We can not just export it. > > > > -- > > Uladzislau Rezki > > Hi Uladzislau, > > Thank you for the feedback. I agree that we should avoid exposing > internal implementation details like struct vm_struct to external > subsystems. > > Following Christoph's suggestion, I'm planning to encapsulate the page > order lookup into a minimal helper instead: > > unsigned int vmalloc_page_order(const void *addr){ > struct vm_struct *vm; > vm = find_vm_area(addr); > return vm ? vm->page_order : 0; > } > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > Does this approach look reasonable to you? It would keep the vm_struct > layout private while satisfying the optimization needs of SMC. > Could you please clarify why you need info about page_order? I have not looked at your second patch. Thanks! -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-24 10:48 ` Uladzislau Rezki @ 2026-01-24 14:57 ` D. Wythe 2026-01-26 10:28 ` Uladzislau Rezki 2026-01-27 13:34 ` Leon Romanovsky 0 siblings, 2 replies; 30+ messages in thread From: D. Wythe @ 2026-01-24 14:57 UTC (permalink / raw) To: Uladzislau Rezki Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > Hello, D. Wythe! > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > virtual address. Export this symbol to modules so that modularized > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > --- > > > > mm/vmalloc.c | 1 + > > > > 1 file changed, 1 insertion(+) > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > --- a/mm/vmalloc.c > > > > +++ b/mm/vmalloc.c > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > return va->vm; > > > > } > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > This is internal. We can not just export it. > > > > > > -- > > > Uladzislau Rezki > > > > Hi Uladzislau, > > > > Thank you for the feedback. I agree that we should avoid exposing > > internal implementation details like struct vm_struct to external > > subsystems. > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > order lookup into a minimal helper instead: > > > > unsigned int vmalloc_page_order(const void *addr){ > > struct vm_struct *vm; > > vm = find_vm_area(addr); > > return vm ? vm->page_order : 0; > > } > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > Does this approach look reasonable to you? It would keep the vm_struct > > layout private while satisfying the optimization needs of SMC. > > > Could you please clarify why you need info about page_order? I have not > looked at your second patch. > > Thanks! > > -- > Uladzislau Rezki Hi Uladzislau, This stems from optimizing memory registration in SMC-R. To provide the RDMA hardware with direct access to memory buffers, we must register them with the NIC. During this process, the hardware generates one MTT entry for each physically contiguous block. Since these hardware entries are a finite and scarce resource, and SMC currently defaults to a 4KB registration granularity, a single 2MB buffer consumes 512 entries. In high-concurrency scenarios, this inefficiency quickly exhausts NIC resources and becomes a major bottleneck for system scalability. To address this, we intend to use vmalloc_huge(). When it successfully allocates high-order pages, the vmalloc area is backed by a sequence of physically contiguous chunks (e.g., 2MB each). If we know this page_order, we can register these larger physical blocks instead of individual 4KB pages, reducing MTT consumption from 512 entries down to 1 for every 2MB of memory (with page_order == 9). However, the result of vmalloc_huge() is currently opaque to the caller. We cannot determine whether it successfully allocated huge pages or fell back to 4KB pages based solely on the returned pointer. Therefore, we need a helper function to query the actual page order, enabling SMC-R to adapt its registration logic to the underlying physical layout. I hope this clarifies our design motivation! Best regards, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-24 14:57 ` D. Wythe @ 2026-01-26 10:28 ` Uladzislau Rezki 2026-01-26 12:02 ` D. Wythe 2026-01-27 13:34 ` Leon Romanovsky 1 sibling, 1 reply; 30+ messages in thread From: Uladzislau Rezki @ 2026-01-26 10:28 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang Hello, D. Wythe! > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > virtual address. Export this symbol to modules so that modularized > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > --- > > > > > mm/vmalloc.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > --- a/mm/vmalloc.c > > > > > +++ b/mm/vmalloc.c > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > return va->vm; > > > > > } > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > This is internal. We can not just export it. > > > > > > > > -- > > > > Uladzislau Rezki > > > > > > Hi Uladzislau, > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > internal implementation details like struct vm_struct to external > > > subsystems. > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > order lookup into a minimal helper instead: > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > struct vm_struct *vm; > > > vm = find_vm_area(addr); > > > return vm ? vm->page_order : 0; > > > } > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > layout private while satisfying the optimization needs of SMC. > > > > > Could you please clarify why you need info about page_order? I have not > > looked at your second patch. > > > > Thanks! > > > > -- > > Uladzislau Rezki > > Hi Uladzislau, > > This stems from optimizing memory registration in SMC-R. To provide the > RDMA hardware with direct access to memory buffers, we must register > them with the NIC. During this process, the hardware generates one MTT > entry for each physically contiguous block. Since these hardware entries > are a finite and scarce resource, and SMC currently defaults to a 4KB > registration granularity, a single 2MB buffer consumes 512 entries. In > high-concurrency scenarios, this inefficiency quickly exhausts NIC > resources and becomes a major bottleneck for system scalability. > > To address this, we intend to use vmalloc_huge(). When it successfully > allocates high-order pages, the vmalloc area is backed by a sequence of > physically contiguous chunks (e.g., 2MB each). If we know this > page_order, we can register these larger physical blocks instead of > individual 4KB pages, reducing MTT consumption from 512 entries down to > 1 for every 2MB of memory (with page_order == 9). > > However, the result of vmalloc_huge() is currently opaque to the caller. > We cannot determine whether it successfully allocated huge pages or fell > back to 4KB pages based solely on the returned pointer. Therefore, we > need a helper function to query the actual page order, enabling SMC-R to > adapt its registration logic to the underlying physical layout. > > I hope this clarifies our design motivation! > Appreciate for the explanation. Yes it clarifies an intention. As for proposed patch above: - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined; - It makes sense to get a node, grab a spin-lock and find VM, save page_order and release the lock. You can have a look at the vmalloc_dump_obj(void *object) function. We try-spinlock there whereas you need just spin-lock. But the idea is the same. -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-26 10:28 ` Uladzislau Rezki @ 2026-01-26 12:02 ` D. Wythe 2026-01-26 16:45 ` Uladzislau Rezki 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-26 12:02 UTC (permalink / raw) To: Uladzislau Rezki Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote: > Hello, D. Wythe! > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > --- > > > > > > mm/vmalloc.c | 1 + > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > --- a/mm/vmalloc.c > > > > > > +++ b/mm/vmalloc.c > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > return va->vm; > > > > > > } > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > -- > > > > > Uladzislau Rezki > > > > > > > > Hi Uladzislau, > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > internal implementation details like struct vm_struct to external > > > > subsystems. > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > order lookup into a minimal helper instead: > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > struct vm_struct *vm; > > > > vm = find_vm_area(addr); > > > > return vm ? vm->page_order : 0; > > > > } > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > Could you please clarify why you need info about page_order? I have not > > > looked at your second patch. > > > > > > Thanks! > > > > > > -- > > > Uladzislau Rezki > > > > Hi Uladzislau, > > > > This stems from optimizing memory registration in SMC-R. To provide the > > RDMA hardware with direct access to memory buffers, we must register > > them with the NIC. During this process, the hardware generates one MTT > > entry for each physically contiguous block. Since these hardware entries > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > registration granularity, a single 2MB buffer consumes 512 entries. In > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > resources and becomes a major bottleneck for system scalability. > > > > To address this, we intend to use vmalloc_huge(). When it successfully > > allocates high-order pages, the vmalloc area is backed by a sequence of > > physically contiguous chunks (e.g., 2MB each). If we know this > > page_order, we can register these larger physical blocks instead of > > individual 4KB pages, reducing MTT consumption from 512 entries down to > > 1 for every 2MB of memory (with page_order == 9). > > > > However, the result of vmalloc_huge() is currently opaque to the caller. > > We cannot determine whether it successfully allocated huge pages or fell > > back to 4KB pages based solely on the returned pointer. Therefore, we > > need a helper function to query the actual page order, enabling SMC-R to > > adapt its registration logic to the underlying physical layout. > > > > I hope this clarifies our design motivation! > > > Appreciate for the explanation. Yes it clarifies an intention. > > As for proposed patch above: > > - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined; > - It makes sense to get a node, grab a spin-lock and find VM, save > page_order and release the lock. > > You can have a look at the vmalloc_dump_obj(void *object) function. > We try-spinlock there whereas you need just spin-lock. But the idea > is the same. > > -- > Uladzislau Rezki Hi Uladzislau, Thanks very much for the detailed guidance, especially on the correct locking pattern. This is extremely helpful.I will follow it and send a v2 patch series with the new helper implemented in mm/vmalloc.c. Thanks again for your support. Best regards, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-26 12:02 ` D. Wythe @ 2026-01-26 16:45 ` Uladzislau Rezki 0 siblings, 0 replies; 30+ messages in thread From: Uladzislau Rezki @ 2026-01-26 16:45 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Mon, Jan 26, 2026 at 08:02:26PM +0800, D. Wythe wrote: > On Mon, Jan 26, 2026 at 11:28:46AM +0100, Uladzislau Rezki wrote: > > Hello, D. Wythe! > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > --- > > > > > > > mm/vmalloc.c | 1 + > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > --- a/mm/vmalloc.c > > > > > > > +++ b/mm/vmalloc.c > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > return va->vm; > > > > > > > } > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > -- > > > > > > Uladzislau Rezki > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > internal implementation details like struct vm_struct to external > > > > > subsystems. > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > order lookup into a minimal helper instead: > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > struct vm_struct *vm; > > > > > vm = find_vm_area(addr); > > > > > return vm ? vm->page_order : 0; > > > > > } > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > looked at your second patch. > > > > > > > > Thanks! > > > > > > > > -- > > > > Uladzislau Rezki > > > > > > Hi Uladzislau, > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > RDMA hardware with direct access to memory buffers, we must register > > > them with the NIC. During this process, the hardware generates one MTT > > > entry for each physically contiguous block. Since these hardware entries > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > resources and becomes a major bottleneck for system scalability. > > > > > > To address this, we intend to use vmalloc_huge(). When it successfully > > > allocates high-order pages, the vmalloc area is backed by a sequence of > > > physically contiguous chunks (e.g., 2MB each). If we know this > > > page_order, we can register these larger physical blocks instead of > > > individual 4KB pages, reducing MTT consumption from 512 entries down to > > > 1 for every 2MB of memory (with page_order == 9). > > > > > > However, the result of vmalloc_huge() is currently opaque to the caller. > > > We cannot determine whether it successfully allocated huge pages or fell > > > back to 4KB pages based solely on the returned pointer. Therefore, we > > > need a helper function to query the actual page order, enabling SMC-R to > > > adapt its registration logic to the underlying physical layout. > > > > > > I hope this clarifies our design motivation! > > > > > Appreciate for the explanation. Yes it clarifies an intention. > > > > As for proposed patch above: > > > > - A page_order is available if CONFIG_HAVE_ARCH_HUGE_VMALLOC is defined; > > - It makes sense to get a node, grab a spin-lock and find VM, save > > page_order and release the lock. > > > > You can have a look at the vmalloc_dump_obj(void *object) function. > > We try-spinlock there whereas you need just spin-lock. But the idea > > is the same. > > > > -- > > Uladzislau Rezki > > Hi Uladzislau, > > Thanks very much for the detailed guidance, especially on the correct > locking pattern. This is extremely helpful.I will follow it and send > a v2 patch series with the new helper implemented in mm/vmalloc.c. > > Thanks again for your support. > Welcome! -- Uladzislau Rezki ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-24 14:57 ` D. Wythe 2026-01-26 10:28 ` Uladzislau Rezki @ 2026-01-27 13:34 ` Leon Romanovsky 2026-01-28 3:45 ` D. Wythe 1 sibling, 1 reply; 30+ messages in thread From: Leon Romanovsky @ 2026-01-27 13:34 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > Hello, D. Wythe! > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > virtual address. Export this symbol to modules so that modularized > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > --- > > > > > mm/vmalloc.c | 1 + > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > --- a/mm/vmalloc.c > > > > > +++ b/mm/vmalloc.c > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > return va->vm; > > > > > } > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > This is internal. We can not just export it. > > > > > > > > -- > > > > Uladzislau Rezki > > > > > > Hi Uladzislau, > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > internal implementation details like struct vm_struct to external > > > subsystems. > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > order lookup into a minimal helper instead: > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > struct vm_struct *vm; > > > vm = find_vm_area(addr); > > > return vm ? vm->page_order : 0; > > > } > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > layout private while satisfying the optimization needs of SMC. > > > > > Could you please clarify why you need info about page_order? I have not > > looked at your second patch. > > > > Thanks! > > > > -- > > Uladzislau Rezki > > Hi Uladzislau, > > This stems from optimizing memory registration in SMC-R. To provide the > RDMA hardware with direct access to memory buffers, we must register > them with the NIC. During this process, the hardware generates one MTT > entry for each physically contiguous block. Since these hardware entries > are a finite and scarce resource, and SMC currently defaults to a 4KB > registration granularity, a single 2MB buffer consumes 512 entries. In > high-concurrency scenarios, this inefficiency quickly exhausts NIC > resources and becomes a major bottleneck for system scalability. I believe this complexity can be avoided by using the RDMA MR pool API, as other ULPs do, for example NVMe. Thanks > > To address this, we intend to use vmalloc_huge(). When it successfully > allocates high-order pages, the vmalloc area is backed by a sequence of > physically contiguous chunks (e.g., 2MB each). If we know this > page_order, we can register these larger physical blocks instead of > individual 4KB pages, reducing MTT consumption from 512 entries down to > 1 for every 2MB of memory (with page_order == 9). > > However, the result of vmalloc_huge() is currently opaque to the caller. > We cannot determine whether it successfully allocated huge pages or fell > back to 4KB pages based solely on the returned pointer. Therefore, we > need a helper function to query the actual page order, enabling SMC-R to > adapt its registration logic to the underlying physical layout. > > I hope this clarifies our design motivation! > > Best regards, > D. Wythe > > > > > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-27 13:34 ` Leon Romanovsky @ 2026-01-28 3:45 ` D. Wythe 2026-01-28 11:13 ` Leon Romanovsky 2026-01-28 18:06 ` Jason Gunthorpe 0 siblings, 2 replies; 30+ messages in thread From: D. Wythe @ 2026-01-28 3:45 UTC (permalink / raw) To: Leon Romanovsky Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > Hello, D. Wythe! > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > --- > > > > > > mm/vmalloc.c | 1 + > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > --- a/mm/vmalloc.c > > > > > > +++ b/mm/vmalloc.c > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > return va->vm; > > > > > > } > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > -- > > > > > Uladzislau Rezki > > > > > > > > Hi Uladzislau, > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > internal implementation details like struct vm_struct to external > > > > subsystems. > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > order lookup into a minimal helper instead: > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > struct vm_struct *vm; > > > > vm = find_vm_area(addr); > > > > return vm ? vm->page_order : 0; > > > > } > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > Could you please clarify why you need info about page_order? I have not > > > looked at your second patch. > > > > > > Thanks! > > > > > > -- > > > Uladzislau Rezki > > > > Hi Uladzislau, > > > > This stems from optimizing memory registration in SMC-R. To provide the > > RDMA hardware with direct access to memory buffers, we must register > > them with the NIC. During this process, the hardware generates one MTT > > entry for each physically contiguous block. Since these hardware entries > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > registration granularity, a single 2MB buffer consumes 512 entries. In > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > resources and becomes a major bottleneck for system scalability. > > I believe this complexity can be avoided by using the RDMA MR pool API, > as other ULPs do, for example NVMe. > > Thanks > Hi Leon, Am I correct in assuming you are suggesting mr_pool to limit the number of MRs as a way to cap MTTE consumption? However, our goal is to maximize the total registered memory within the MTTE limits rather than to cap it. In SMC-R, each connection occupies a configurable, fixed-size registered buffer; consequently, the more memory we can register, the more concurrent connections we can support. By leveraging vmalloc_huge() and the proposed helper to increase the page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous physical block. This significantly reduces the total number of entries required to map the same amount of memory, allowing us to serve more connections under the same hardware constraints D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 3:45 ` D. Wythe @ 2026-01-28 11:13 ` Leon Romanovsky 2026-01-28 12:44 ` D. Wythe 2026-01-28 18:06 ` Jason Gunthorpe 1 sibling, 1 reply; 30+ messages in thread From: Leon Romanovsky @ 2026-01-28 11:13 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > Hello, D. Wythe! > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > --- > > > > > > > mm/vmalloc.c | 1 + > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > --- a/mm/vmalloc.c > > > > > > > +++ b/mm/vmalloc.c > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > return va->vm; > > > > > > > } > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > -- > > > > > > Uladzislau Rezki > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > internal implementation details like struct vm_struct to external > > > > > subsystems. > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > order lookup into a minimal helper instead: > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > struct vm_struct *vm; > > > > > vm = find_vm_area(addr); > > > > > return vm ? vm->page_order : 0; > > > > > } > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > looked at your second patch. > > > > > > > > Thanks! > > > > > > > > -- > > > > Uladzislau Rezki > > > > > > Hi Uladzislau, > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > RDMA hardware with direct access to memory buffers, we must register > > > them with the NIC. During this process, the hardware generates one MTT > > > entry for each physically contiguous block. Since these hardware entries > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > resources and becomes a major bottleneck for system scalability. > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > as other ULPs do, for example NVMe. > > > > Thanks > > > > Hi Leon, > > Am I correct in assuming you are suggesting mr_pool to limit the number > of MRs as a way to cap MTTE consumption? I don't see this a limit, but something that is considered standard practice to reduce MTT consumption. > > However, our goal is to maximize the total registered memory within > the MTTE limits rather than to cap it. In SMC-R, each connection > occupies a configurable, fixed-size registered buffer; consequently, > the more memory we can register, the more concurrent connections > we can support. It is not cap, but more efficient use of existing resources. > > By leveraging vmalloc_huge() and the proposed helper to increase the > page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous > physical block. This significantly reduces the total number of entries > required to map the same amount of memory, allowing us to serve more > connections under the same hardware constraints > > D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 11:13 ` Leon Romanovsky @ 2026-01-28 12:44 ` D. Wythe 2026-01-28 13:49 ` Leon Romanovsky 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-28 12:44 UTC (permalink / raw) To: Leon Romanovsky Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote: > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > > Hello, D. Wythe! > > > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > > --- > > > > > > > > mm/vmalloc.c | 1 + > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > > --- a/mm/vmalloc.c > > > > > > > > +++ b/mm/vmalloc.c > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > > > return va->vm; > > > > > > > > } > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > > > -- > > > > > > > Uladzislau Rezki > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > > internal implementation details like struct vm_struct to external > > > > > > subsystems. > > > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > > order lookup into a minimal helper instead: > > > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > > struct vm_struct *vm; > > > > > > vm = find_vm_area(addr); > > > > > > return vm ? vm->page_order : 0; > > > > > > } > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > > looked at your second patch. > > > > > > > > > > Thanks! > > > > > > > > > > -- > > > > > Uladzislau Rezki > > > > > > > > Hi Uladzislau, > > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > > RDMA hardware with direct access to memory buffers, we must register > > > > them with the NIC. During this process, the hardware generates one MTT > > > > entry for each physically contiguous block. Since these hardware entries > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > > resources and becomes a major bottleneck for system scalability. > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > > as other ULPs do, for example NVMe. > > > > > > Thanks > > > > > > > Hi Leon, > > > > Am I correct in assuming you are suggesting mr_pool to limit the number > > of MRs as a way to cap MTTE consumption? > > I don't see this a limit, but something that is considered standard > practice to reduce MTT consumption. > > > > > However, our goal is to maximize the total registered memory within > > the MTTE limits rather than to cap it. In SMC-R, each connection > > occupies a configurable, fixed-size registered buffer; consequently, > > the more memory we can register, the more concurrent connections > > we can support. > > It is not cap, but more efficient use of existing resources. Got it. While MRs pool might be more standard practice, but it doesn't address our specific bottleneck. In fact, smc already has its own internal MR reuse; our core issue remains reducing MTTE consumption by increasing the registration granularity to maximize the memory size mapped per MTT entry. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 12:44 ` D. Wythe @ 2026-01-28 13:49 ` Leon Romanovsky 2026-01-29 11:03 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Leon Romanovsky @ 2026-01-28 13:49 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote: > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote: > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > > > Hello, D. Wythe! > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > > > --- > > > > > > > > > mm/vmalloc.c | 1 + > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > > > --- a/mm/vmalloc.c > > > > > > > > > +++ b/mm/vmalloc.c > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > > > > > return va->vm; > > > > > > > > > } > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > > > > > -- > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > > > internal implementation details like struct vm_struct to external > > > > > > > subsystems. > > > > > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > > > order lookup into a minimal helper instead: > > > > > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > > > struct vm_struct *vm; > > > > > > > vm = find_vm_area(addr); > > > > > > > return vm ? vm->page_order : 0; > > > > > > > } > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > > > looked at your second patch. > > > > > > > > > > > > Thanks! > > > > > > > > > > > > -- > > > > > > Uladzislau Rezki > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > > > RDMA hardware with direct access to memory buffers, we must register > > > > > them with the NIC. During this process, the hardware generates one MTT > > > > > entry for each physically contiguous block. Since these hardware entries > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > > > resources and becomes a major bottleneck for system scalability. > > > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > > > as other ULPs do, for example NVMe. > > > > > > > > Thanks > > > > > > > > > > Hi Leon, > > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number > > > of MRs as a way to cap MTTE consumption? > > > > I don't see this a limit, but something that is considered standard > > practice to reduce MTT consumption. > > > > > > > > However, our goal is to maximize the total registered memory within > > > the MTTE limits rather than to cap it. In SMC-R, each connection > > > occupies a configurable, fixed-size registered buffer; consequently, > > > the more memory we can register, the more concurrent connections > > > we can support. > > > > It is not cap, but more efficient use of existing resources. > > Got it. While MRs pool might be more standard practice, but it doesn't > address our specific bottleneck. In fact, smc already has its own internal > MR reuse; our core issue remains reducing MTTE consumption by increasing the > registration granularity to maximize the memory size mapped per MTT entry. And this is something MR pools can handle as well. We are going in circles, so let's summarize. I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the existing ULP API used by NVMe, NFS, and others, rather than maintaining its own internal logic. I also do not know whether vmalloc_page_order() is an appropriate solution; I only want to show that we can probably achieve the same result without introducing a new function. Thanks ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 13:49 ` Leon Romanovsky @ 2026-01-29 11:03 ` D. Wythe 2026-01-29 12:22 ` Leon Romanovsky 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-29 11:03 UTC (permalink / raw) To: Leon Romanovsky Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote: > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote: > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote: > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > > > > Hello, D. Wythe! > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > > > > --- > > > > > > > > > > mm/vmalloc.c | 1 + > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > > > > --- a/mm/vmalloc.c > > > > > > > > > > +++ b/mm/vmalloc.c > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > > > > > > > return va->vm; > > > > > > > > > > } > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > > > > internal implementation details like struct vm_struct to external > > > > > > > > subsystems. > > > > > > > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > > > > order lookup into a minimal helper instead: > > > > > > > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > > > > struct vm_struct *vm; > > > > > > > > vm = find_vm_area(addr); > > > > > > > > return vm ? vm->page_order : 0; > > > > > > > > } > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > > > > looked at your second patch. > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > -- > > > > > > > Uladzislau Rezki > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > > > > RDMA hardware with direct access to memory buffers, we must register > > > > > > them with the NIC. During this process, the hardware generates one MTT > > > > > > entry for each physically contiguous block. Since these hardware entries > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > > > > resources and becomes a major bottleneck for system scalability. > > > > > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > > > > as other ULPs do, for example NVMe. > > > > > > > > > > Thanks > > > > > > > > > > > > > Hi Leon, > > > > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number > > > > of MRs as a way to cap MTTE consumption? > > > > > > I don't see this a limit, but something that is considered standard > > > practice to reduce MTT consumption. > > > > > > > > > > > However, our goal is to maximize the total registered memory within > > > > the MTTE limits rather than to cap it. In SMC-R, each connection > > > > occupies a configurable, fixed-size registered buffer; consequently, > > > > the more memory we can register, the more concurrent connections > > > > we can support. > > > > > > It is not cap, but more efficient use of existing resources. > > > > Got it. While MRs pool might be more standard practice, but it doesn't > > address our specific bottleneck. In fact, smc already has its own internal > > MR reuse; our core issue remains reducing MTTE consumption by increasing the > > registration granularity to maximize the memory size mapped per MTT entry. > > And this is something MR pools can handle as well. We are going in circles, > so let's summarize. I believe some points need to be thoroughly clarified here: > > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the > existing ULP API used by NVMe, NFS, and others, rather than maintaining its > own internal logic. SMC is not opposed to adopting newer RDMA interfaces; in fact, I have already planned a gradual migration to the updated RDMA APIs. We are currently in the process of adapting to ib_cqe, for instance. As long as functionality remains intact, there is no reason to oppose changes that reduce maintenance overhead or provide additional gains, but such a transition takes time. > > I also do not know whether vmalloc_page_order() is an appropriate solution; > I only want to show that we can probably achieve the same result without > introducing a new function. Regarding the specific issue under discussion, I believe the newer RDMA APIs you mentioned do not solve my problem, at least for now. My understanding is that regardless of how MRs are pooled, the core requirement is to increase the page_size parameter in ib_map_mr_sg to maximize the physical size mapped per MTTE. From the code I have examined, I see no evidence of these new APIs utilizing values other than 4KB. Of course, I believe that regardless of whether this issue currently exists, it is something the RDMA community can resolve. However, as I mentioned, adapting to new API takes time. Before a complete transition is achieved, we need to allow for some necessary updates to SMC. Thanks ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-29 11:03 ` D. Wythe @ 2026-01-29 12:22 ` Leon Romanovsky 2026-01-29 14:04 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Leon Romanovsky @ 2026-01-29 12:22 UTC (permalink / raw) To: D. Wythe Cc: Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote: > On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote: > > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote: > > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote: > > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > > > > > Hello, D. Wythe! > > > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > > > > > --- > > > > > > > > > > > mm/vmalloc.c | 1 + > > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > > > > > --- a/mm/vmalloc.c > > > > > > > > > > > +++ b/mm/vmalloc.c > > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > > > > > > > > > return va->vm; > > > > > > > > > > > } > > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > > > > > internal implementation details like struct vm_struct to external > > > > > > > > > subsystems. > > > > > > > > > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > > > > > order lookup into a minimal helper instead: > > > > > > > > > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > > > > > struct vm_struct *vm; > > > > > > > > > vm = find_vm_area(addr); > > > > > > > > > return vm ? vm->page_order : 0; > > > > > > > > > } > > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > > > > > looked at your second patch. > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > -- > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > > > > > RDMA hardware with direct access to memory buffers, we must register > > > > > > > them with the NIC. During this process, the hardware generates one MTT > > > > > > > entry for each physically contiguous block. Since these hardware entries > > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > > > > > resources and becomes a major bottleneck for system scalability. > > > > > > > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > > > > > as other ULPs do, for example NVMe. > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > Hi Leon, > > > > > > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number > > > > > of MRs as a way to cap MTTE consumption? > > > > > > > > I don't see this a limit, but something that is considered standard > > > > practice to reduce MTT consumption. > > > > > > > > > > > > > > However, our goal is to maximize the total registered memory within > > > > > the MTTE limits rather than to cap it. In SMC-R, each connection > > > > > occupies a configurable, fixed-size registered buffer; consequently, > > > > > the more memory we can register, the more concurrent connections > > > > > we can support. > > > > > > > > It is not cap, but more efficient use of existing resources. > > > > > > Got it. While MRs pool might be more standard practice, but it doesn't > > > address our specific bottleneck. In fact, smc already has its own internal > > > MR reuse; our core issue remains reducing MTTE consumption by increasing the > > > registration granularity to maximize the memory size mapped per MTT entry. > > > > And this is something MR pools can handle as well. We are going in circles, > > so let's summarize. > > I believe some points need to be thoroughly clarified here: > > > > > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the > > existing ULP API used by NVMe, NFS, and others, rather than maintaining its > > own internal logic. > > SMC is not opposed to adopting newer RDMA interfaces; in fact, I have > already planned a gradual migration to the updated RDMA APIs. We are > currently in the process of adapting to ib_cqe, for instance. As long as > functionality remains intact, there is no reason to oppose changes that > reduce maintenance overhead or provide additional gains, but such a > transition takes time. > > > > > I also do not know whether vmalloc_page_order() is an appropriate solution; > > I only want to show that we can probably achieve the same result without > > introducing a new function. > > Regarding the specific issue under discussion, I believe the newer RDMA > APIs you mentioned do not solve my problem, at least for now. My > understanding is that regardless of how MRs are pooled, the core > requirement is to increase the page_size parameter in ib_map_mr_sg to > maximize the physical size mapped per MTTE. From the code I have > examined, I see no evidence of these new APIs utilizing values other > than 4KB. > > Of course, I believe that regardless of whether this issue > currently exists, it is something the RDMA community can resolve. > However, as I mentioned, adapting to new API takes time. Before a > complete transition is achieved, we need to allow for some necessary > updates to SMC. I disagree with that statement. SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and not always correctly. https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/ https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t Thanks > > Thanks > ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-29 12:22 ` Leon Romanovsky @ 2026-01-29 14:04 ` D. Wythe 0 siblings, 0 replies; 30+ messages in thread From: D. Wythe @ 2026-01-29 14:04 UTC (permalink / raw) To: Leon Romanovsky Cc: D. Wythe, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Thu, Jan 29, 2026 at 02:22:02PM +0200, Leon Romanovsky wrote: > On Thu, Jan 29, 2026 at 07:03:23PM +0800, D. Wythe wrote: > > On Wed, Jan 28, 2026 at 03:49:34PM +0200, Leon Romanovsky wrote: > > > On Wed, Jan 28, 2026 at 08:44:04PM +0800, D. Wythe wrote: > > > > On Wed, Jan 28, 2026 at 01:13:46PM +0200, Leon Romanovsky wrote: > > > > > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > > > > > On Tue, Jan 27, 2026 at 03:34:17PM +0200, Leon Romanovsky wrote: > > > > > > > On Sat, Jan 24, 2026 at 10:57:54PM +0800, D. Wythe wrote: > > > > > > > > On Sat, Jan 24, 2026 at 11:48:59AM +0100, Uladzislau Rezki wrote: > > > > > > > > > Hello, D. Wythe! > > > > > > > > > > > > > > > > > > > On Fri, Jan 23, 2026 at 07:55:17PM +0100, Uladzislau Rezki wrote: > > > > > > > > > > > On Fri, Jan 23, 2026 at 04:23:48PM +0800, D. Wythe wrote: > > > > > > > > > > > > find_vm_area() provides a way to find the vm_struct associated with a > > > > > > > > > > > > virtual address. Export this symbol to modules so that modularized > > > > > > > > > > > > subsystems can perform lookups on vmalloc addresses. > > > > > > > > > > > > > > > > > > > > > > > > Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> > > > > > > > > > > > > --- > > > > > > > > > > > > mm/vmalloc.c | 1 + > > > > > > > > > > > > 1 file changed, 1 insertion(+) > > > > > > > > > > > > > > > > > > > > > > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c > > > > > > > > > > > > index ecbac900c35f..3eb9fe761c34 100644 > > > > > > > > > > > > --- a/mm/vmalloc.c > > > > > > > > > > > > +++ b/mm/vmalloc.c > > > > > > > > > > > > @@ -3292,6 +3292,7 @@ struct vm_struct *find_vm_area(const void *addr) > > > > > > > > > > > > > > > > > > > > > > > > return va->vm; > > > > > > > > > > > > } > > > > > > > > > > > > +EXPORT_SYMBOL_GPL(find_vm_area); > > > > > > > > > > > > > > > > > > > > > > > This is internal. We can not just export it. > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > > > > > > > Thank you for the feedback. I agree that we should avoid exposing > > > > > > > > > > internal implementation details like struct vm_struct to external > > > > > > > > > > subsystems. > > > > > > > > > > > > > > > > > > > > Following Christoph's suggestion, I'm planning to encapsulate the page > > > > > > > > > > order lookup into a minimal helper instead: > > > > > > > > > > > > > > > > > > > > unsigned int vmalloc_page_order(const void *addr){ > > > > > > > > > > struct vm_struct *vm; > > > > > > > > > > vm = find_vm_area(addr); > > > > > > > > > > return vm ? vm->page_order : 0; > > > > > > > > > > } > > > > > > > > > > EXPORT_SYMBOL_GPL(vmalloc_page_order); > > > > > > > > > > > > > > > > > > > > Does this approach look reasonable to you? It would keep the vm_struct > > > > > > > > > > layout private while satisfying the optimization needs of SMC. > > > > > > > > > > > > > > > > > > > Could you please clarify why you need info about page_order? I have not > > > > > > > > > looked at your second patch. > > > > > > > > > > > > > > > > > > Thanks! > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Uladzislau Rezki > > > > > > > > > > > > > > > > Hi Uladzislau, > > > > > > > > > > > > > > > > This stems from optimizing memory registration in SMC-R. To provide the > > > > > > > > RDMA hardware with direct access to memory buffers, we must register > > > > > > > > them with the NIC. During this process, the hardware generates one MTT > > > > > > > > entry for each physically contiguous block. Since these hardware entries > > > > > > > > are a finite and scarce resource, and SMC currently defaults to a 4KB > > > > > > > > registration granularity, a single 2MB buffer consumes 512 entries. In > > > > > > > > high-concurrency scenarios, this inefficiency quickly exhausts NIC > > > > > > > > resources and becomes a major bottleneck for system scalability. > > > > > > > > > > > > > > I believe this complexity can be avoided by using the RDMA MR pool API, > > > > > > > as other ULPs do, for example NVMe. > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > Hi Leon, > > > > > > > > > > > > Am I correct in assuming you are suggesting mr_pool to limit the number > > > > > > of MRs as a way to cap MTTE consumption? > > > > > > > > > > I don't see this a limit, but something that is considered standard > > > > > practice to reduce MTT consumption. > > > > > > > > > > > > > > > > > However, our goal is to maximize the total registered memory within > > > > > > the MTTE limits rather than to cap it. In SMC-R, each connection > > > > > > occupies a configurable, fixed-size registered buffer; consequently, > > > > > > the more memory we can register, the more concurrent connections > > > > > > we can support. > > > > > > > > > > It is not cap, but more efficient use of existing resources. > > > > > > > > Got it. While MRs pool might be more standard practice, but it doesn't > > > > address our specific bottleneck. In fact, smc already has its own internal > > > > MR reuse; our core issue remains reducing MTTE consumption by increasing the > > > > registration granularity to maximize the memory size mapped per MTT entry. > > > > > > And this is something MR pools can handle as well. We are going in circles, > > > so let's summarize. > > > > I believe some points need to be thoroughly clarified here: > > > > > > > > I see SMC‑R as one of the RDMA ULPs, and it should ideally rely on the > > > existing ULP API used by NVMe, NFS, and others, rather than maintaining its > > > own internal logic. > > > > SMC is not opposed to adopting newer RDMA interfaces; in fact, I have > > already planned a gradual migration to the updated RDMA APIs. We are > > currently in the process of adapting to ib_cqe, for instance. As long as > > functionality remains intact, there is no reason to oppose changes that > > reduce maintenance overhead or provide additional gains, but such a > > transition takes time. > > > > > > > > I also do not know whether vmalloc_page_order() is an appropriate solution; > > > I only want to show that we can probably achieve the same result without > > > introducing a new function. > > > > Regarding the specific issue under discussion, I believe the newer RDMA > > APIs you mentioned do not solve my problem, at least for now. My > > understanding is that regardless of how MRs are pooled, the core > > requirement is to increase the page_size parameter in ib_map_mr_sg to > > maximize the physical size mapped per MTTE. From the code I have > > examined, I see no evidence of these new APIs utilizing values other > > than 4KB. > > > > Of course, I believe that regardless of whether this issue > > currently exists, it is something the RDMA community can resolve. > > However, as I mentioned, adapting to new API takes time. Before a > > complete transition is achieved, we need to allow for some necessary > > updates to SMC. > > I disagree with that statement. > > SMC‑R has a long history of re‑implementing existing RDMA ULP APIs, and > not always correctly. > > https://lore.kernel.org/netdev/20170510072627.12060-1-hch@lst.de/ > https://lore.kernel.org/netdev/20241105112313.GE311159@unreal/#t > I see that this discussion has moved beyond the technical scope of the patch into historical design critiques. I do not wish to engage in a debate over SMC's history, nor am I responsible for those past decisions. I will discontinue the conversation here. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 3:45 ` D. Wythe 2026-01-28 11:13 ` Leon Romanovsky @ 2026-01-28 18:06 ` Jason Gunthorpe 2026-01-29 11:36 ` D. Wythe 1 sibling, 1 reply; 30+ messages in thread From: Jason Gunthorpe @ 2026-01-28 18:06 UTC (permalink / raw) To: D. Wythe Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > By leveraging vmalloc_huge() and the proposed helper to increase the > page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous > physical block. This doesn't seem right, if your goal is to take a vmalloc() pointer and convert it to a MR via a scatterlist and ib_map_mr_sg() then you should be asking for a helper to convert a kernel pointer into a scatterlist. Even if you do this in a naive way and call the sg_alloc_append_table_from_pages() function it will automatically join physically contiguous ranges together for you. From there you can check the resulting scatterlist and compute the page_size to pass to ib_map_mr_sg(). No need to ask the MM for anything other than the list of physicals to build the scatterlist with. Still, I wouldn't mind seeing a helper to convert a kernel pointer into a scatterlist because I have see that opencoded in a few places, and maybe there are ways to optimize that using more information from the MM - but it should be APIs used only by this helper not exposed to drivers. Jason ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-28 18:06 ` Jason Gunthorpe @ 2026-01-29 11:36 ` D. Wythe 2026-01-29 13:20 ` Jason Gunthorpe 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-29 11:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Wed, Jan 28, 2026 at 02:06:29PM -0400, Jason Gunthorpe wrote: > On Wed, Jan 28, 2026 at 11:45:58AM +0800, D. Wythe wrote: > > > By leveraging vmalloc_huge() and the proposed helper to increase the > > page_size in ib_map_mr_sg(), each MTTE covers a much larger contiguous > > physical block. > > This doesn't seem right, if your goal is to take a vmalloc() pointer > and convert it to a MR via a scatterlist and ib_map_mr_sg() then you > should be asking for a helper to convert a kernel pointer into a > scatterlist. > > Even if you do this in a naive way and call the > sg_alloc_append_table_from_pages() function it will automatically join > physically contiguous ranges together for you. > > From there you can check the resulting scatterlist and compute the > page_size to pass to ib_map_mr_sg(). > > No need to ask the MM for anything other than the list of physicals to > build the scatterlist with. > > Still, I wouldn't mind seeing a helper to convert a kernel pointer > into a scatterlist because I have see that opencoded in a few places, > and maybe there are ways to optimize that using more information from > the MM - but it should be APIs used only by this helper not exposed to > drivers. > > Jason Hi Jason, To be honest, I was previously unaware of the sg_alloc_append_table_from_pages() function, although I had indeed considered manually calculating the size of contiguous physical blocks. The reason I proposed the MM helper is that SMC is not driver, it utilizes vmalloc() for memory allocation and is thus in direct contact with the MM; from this perspective, having the MM provide the page_order would be the most straightforward approach. Given the significant opposition and our plans to transition SMC to newer APIs in the future anyway, I agree that introducing this helper now is less justified. I will follow your suggestion and update the next version accordingly. Thanks. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-29 11:36 ` D. Wythe @ 2026-01-29 13:20 ` Jason Gunthorpe 2026-01-30 8:51 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Jason Gunthorpe @ 2026-01-29 13:20 UTC (permalink / raw) To: D. Wythe Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote: > > From there you can check the resulting scatterlist and compute the > > page_size to pass to ib_map_mr_sg(). I should clarify this is done after DMA mapping the scatterlist. dma mapping can improve the page size. And maybe the core code should be helping compute the MR's target page size for a scatterlist.. We already have code to do this in umem, and it is a pretty bit tricky considering the IOVA related rules. Jason ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-29 13:20 ` Jason Gunthorpe @ 2026-01-30 8:51 ` D. Wythe 2026-01-30 15:16 ` Jason Gunthorpe 0 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-30 8:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote: > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote: > > > > From there you can check the resulting scatterlist and compute the > > > page_size to pass to ib_map_mr_sg(). > > I should clarify this is done after DMA mapping the scatterlist. dma > mapping can improve the page size. > > And maybe the core code should be helping compute the MR's target page > size for a scatterlist.. We already have code to do this in umem, and > it is a pretty bit tricky considering the IOVA related rules. > Hi Jason, After a deep dive into ib_umem_find_best_pgsz(), I have to say it is much more subtle than it first appears. The IOVA-to-PA relative offset rules, in particular, make it quite easy to get wrong. While SMC could duplicate this logic, it is certainly not ideal for maintenance. Are there any plans to refactor this into a generic RDMA core helper—for instance, one that can determine the best page size directly from an sg_table or scatterlist? Best regards, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-30 8:51 ` D. Wythe @ 2026-01-30 15:16 ` Jason Gunthorpe 2026-02-03 9:14 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Jason Gunthorpe @ 2026-01-30 15:16 UTC (permalink / raw) To: D. Wythe Cc: Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote: > On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote: > > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote: > > > > > > From there you can check the resulting scatterlist and compute the > > > > page_size to pass to ib_map_mr_sg(). > > > > I should clarify this is done after DMA mapping the scatterlist. dma > > mapping can improve the page size. > > > > And maybe the core code should be helping compute the MR's target page > > size for a scatterlist.. We already have code to do this in umem, and > > it is a pretty bit tricky considering the IOVA related rules. > > > > Hi Jason, > > After a deep dive into ib_umem_find_best_pgsz(), I have to say it is > much more subtle than it first appears. The IOVA-to-PA relative offset > rules, in particular, make it quite easy to get wrong. > > While SMC could duplicate this logic, it is certainly not ideal for > maintenance. Are there any plans to refactor this into a generic RDMA > core helper—for instance, one that can determine the best page size > directly from an sg_table or scatterlist? I have not heard of anyone touching this. It looks like there are only two users in the kernel that pass something other than PAGE_SIZE, so it seems nobody has cared about this till now. With high order folios being more common it seems like something missing. However, I wonder what the drivers do with the input page size, segmenting a scatterlist is a bit hard and we have helpers for that already too. It is a bigger project but probably the right thing is to remove the page size input, wrap the scatterlist in a umem and fixup the drivers to use the existing umem support for building mtts, splitting scatterlists into blocks and so on. The kernel side here has been left alone for a long time.. Jason ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() 2026-01-30 15:16 ` Jason Gunthorpe @ 2026-02-03 9:14 ` D. Wythe 0 siblings, 0 replies; 30+ messages in thread From: D. Wythe @ 2026-02-03 9:14 UTC (permalink / raw) To: Jason Gunthorpe Cc: D. Wythe, Leon Romanovsky, Uladzislau Rezki, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 30, 2026 at 11:16:36AM -0400, Jason Gunthorpe wrote: > On Fri, Jan 30, 2026 at 04:51:31PM +0800, D. Wythe wrote: > > On Thu, Jan 29, 2026 at 09:20:58AM -0400, Jason Gunthorpe wrote: > > > On Thu, Jan 29, 2026 at 07:36:09PM +0800, D. Wythe wrote: > > > > > > > > From there you can check the resulting scatterlist and compute the > > > > > page_size to pass to ib_map_mr_sg(). > > > > > > I should clarify this is done after DMA mapping the scatterlist. dma > > > mapping can improve the page size. > > > > > > And maybe the core code should be helping compute the MR's target page > > > size for a scatterlist.. We already have code to do this in umem, and > > > it is a pretty bit tricky considering the IOVA related rules. > > > > > > > Hi Jason, > > > > After a deep dive into ib_umem_find_best_pgsz(), I have to say it is > > much more subtle than it first appears. The IOVA-to-PA relative offset > > rules, in particular, make it quite easy to get wrong. > > > > While SMC could duplicate this logic, it is certainly not ideal for > > maintenance. Are there any plans to refactor this into a generic RDMA > > core helper—for instance, one that can determine the best page size > > directly from an sg_table or scatterlist? > > I have not heard of anyone touching this. > > It looks like there are only two users in the kernel that pass > something other than PAGE_SIZE, so it seems nobody has cared about > this till now. > > With high order folios being more common it seems like something > missing. > > However, I wonder what the drivers do with the input page size, > segmenting a scatterlist is a bit hard and we have helpers for that > already too. > > It is a bigger project but probably the right thing is to remove the > page size input, wrap the scatterlist in a umem and fixup the drivers > to use the existing umem support for building mtts, splitting > scatterlists into blocks and so on. > > The kernel side here has been left alone for a long time.. I am also curious about the original design intent behind requiring the caller to explicitly pass `page_size`. From what I can see, its primary role is to define the memory size per MTTE, but calculating the optimal value is surprisingly complex. I completely agree that providing an automatic way to optimize or calculate the best page size should be the responsibility of the drivers or the RDMA core themselves. Handling such low-level hardware-related details in a ULP like SMC feels misplaced. Since it appears this isn't a high-priority issue for the community at the moment, and a proper fix requires a much larger architectural effort in the RDMA core, I will withdraw this patch series. I'll keep an eye on the RDMA subsystem's progress and see if a more generic solution emerges in the future. Thanks, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers 2026-01-23 8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe 2026-01-23 8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe 2026-01-23 8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe @ 2026-01-23 8:23 ` D. Wythe 2026-01-23 14:52 ` Christoph Hellwig 2 siblings, 1 reply; 30+ messages in thread From: D. Wythe @ 2026-01-23 8:23 UTC (permalink / raw) To: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang Cc: Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang SMC-R buffers currently use 4KB page mapping for IB registration. Each page consumes one MTTE, which is inefficient and quickly depletes limited IB hardware resources for large buffers. For virtual contiguous buffer, switch to vmalloc_huge() to leverage huge page support. By using larger page sizes during IB MR registration, we can drastically reduce MTTE consumption. For physically contiguous buffer, the entire buffer now requires only one single MTTE. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> --- net/smc/smc_core.c | 3 ++- net/smc/smc_ib.c | 23 ++++++++++++++++++++--- 2 files changed, 22 insertions(+), 4 deletions(-) diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c index 6219db498976..8aca5dc54be7 100644 --- a/net/smc/smc_core.c +++ b/net/smc/smc_core.c @@ -2348,7 +2348,8 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr, goto out; fallthrough; // try virtually contiguous buf case SMCR_VIRT_CONT_BUFS: - buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order); + buf_desc->cpu_addr = vmalloc_huge(PAGE_SIZE << buf_desc->order, + GFP_KERNEL | __GFP_ZERO); if (!buf_desc->cpu_addr) goto out; buf_desc->pages = NULL; diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c index 1154907c5c05..67211d44a1db 100644 --- a/net/smc/smc_ib.c +++ b/net/smc/smc_ib.c @@ -20,6 +20,7 @@ #include <linux/wait.h> #include <linux/mutex.h> #include <linux/inetdevice.h> +#include <linux/vmalloc.h> #include <rdma/ib_verbs.h> #include <rdma/ib_cache.h> @@ -697,6 +698,18 @@ void smc_ib_put_memory_region(struct ib_mr *mr) ib_dereg_mr(mr); } +static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot) +{ +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC + struct vm_struct *vm; + + vm = find_vm_area(buf_slot->cpu_addr); + return vm ? vm->page_order : 0; +#else + return 0; +#endif +} + static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx) { unsigned int offset = 0; @@ -706,8 +719,9 @@ static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx) sg_num = ib_map_mr_sg(buf_slot->mr[link_idx], buf_slot->sgt[link_idx].sgl, buf_slot->sgt[link_idx].orig_nents, - &offset, PAGE_SIZE); - + &offset, + buf_slot->is_vm ? PAGE_SIZE << smc_buf_get_vm_page_order(buf_slot) : + PAGE_SIZE << buf_slot->order); return sg_num; } @@ -719,7 +733,10 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags, return 0; /* already done */ buf_slot->mr[link_idx] = - ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order); + ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, + buf_slot->is_vm ? + 1 << (buf_slot->order - smc_buf_get_vm_page_order(buf_slot)) : 1); + if (IS_ERR(buf_slot->mr[link_idx])) { int rc; -- 2.45.0 ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers 2026-01-23 8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe @ 2026-01-23 14:52 ` Christoph Hellwig 2026-01-24 9:25 ` D. Wythe 0 siblings, 1 reply; 30+ messages in thread From: Christoph Hellwig @ 2026-01-23 14:52 UTC (permalink / raw) To: D. Wythe Cc: David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 04:23:49PM +0800, D. Wythe wrote: > +static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot) > +{ > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC > + struct vm_struct *vm; > + > + vm = find_vm_area(buf_slot->cpu_addr); > + return vm ? vm->page_order : 0; > +#else > + return 0; > +#endif You might want to encapsulate this logic in a vmalloc_order or similar helper in vmalloc.c. ^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers 2026-01-23 14:52 ` Christoph Hellwig @ 2026-01-24 9:25 ` D. Wythe 0 siblings, 0 replies; 30+ messages in thread From: D. Wythe @ 2026-01-24 9:25 UTC (permalink / raw) To: Christoph Hellwig Cc: D. Wythe, David S. Miller, Andrew Morton, Dust Li, Eric Dumazet, Jakub Kicinski, Paolo Abeni, Sidraya Jayagond, Uladzislau Rezki, Wenjia Zhang, Mahanta Jambigi, Simon Horman, Tony Lu, Wen Gu, linux-kernel, linux-mm, linux-rdma, linux-s390, netdev, oliver.yang On Fri, Jan 23, 2026 at 06:52:55AM -0800, Christoph Hellwig wrote: > On Fri, Jan 23, 2026 at 04:23:49PM +0800, D. Wythe wrote: > > +static inline int smc_buf_get_vm_page_order(struct smc_buf_desc *buf_slot) > > +{ > > +#ifdef CONFIG_HAVE_ARCH_HUGE_VMALLOC > > + struct vm_struct *vm; > > + > > + vm = find_vm_area(buf_slot->cpu_addr); > > + return vm ? vm->page_order : 0; > > +#else > > + return 0; > > +#endif > > You might want to encapsulate this logic in a vmalloc_order or similar > helper in vmalloc.c. Hi Christoph, That's a great suggestion. Encapsulating this logic into a helper like vmalloc_page_order() (or similar) within vmalloc.c is indeed much cleaner than exporting find_vm_area(). I'll implement this helper in V2 and use it in the SMC code. Thanks for pointing this out! Thanks, D. Wythe ^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2026-02-03 9:15 UTC | newest] Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2026-01-23 8:23 [PATCH net-next 0/3] net/smc: buffer allocation and registration improvements D. Wythe 2026-01-23 8:23 ` [PATCH net-next 1/3] net/smc: cap allocation order for SMC-R physically contiguous buffers D. Wythe 2026-01-23 10:54 ` Alexandra Winter 2026-01-24 9:22 ` D. Wythe 2026-01-23 8:23 ` [PATCH net-next 2/3] mm: vmalloc: export find_vm_area() D. Wythe 2026-01-23 14:44 ` Christoph Hellwig 2026-01-23 18:55 ` Uladzislau Rezki 2026-01-24 9:35 ` D. Wythe 2026-01-24 10:48 ` Uladzislau Rezki 2026-01-24 14:57 ` D. Wythe 2026-01-26 10:28 ` Uladzislau Rezki 2026-01-26 12:02 ` D. Wythe 2026-01-26 16:45 ` Uladzislau Rezki 2026-01-27 13:34 ` Leon Romanovsky 2026-01-28 3:45 ` D. Wythe 2026-01-28 11:13 ` Leon Romanovsky 2026-01-28 12:44 ` D. Wythe 2026-01-28 13:49 ` Leon Romanovsky 2026-01-29 11:03 ` D. Wythe 2026-01-29 12:22 ` Leon Romanovsky 2026-01-29 14:04 ` D. Wythe 2026-01-28 18:06 ` Jason Gunthorpe 2026-01-29 11:36 ` D. Wythe 2026-01-29 13:20 ` Jason Gunthorpe 2026-01-30 8:51 ` D. Wythe 2026-01-30 15:16 ` Jason Gunthorpe 2026-02-03 9:14 ` D. Wythe 2026-01-23 8:23 ` [PATCH net-next 3/3] net/smc: optimize MTTE consumption for SMC-R buffers D. Wythe 2026-01-23 14:52 ` Christoph Hellwig 2026-01-24 9:25 ` D. Wythe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox