Report: Performance regression from ib_umem

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Report: Performance regression from ib_umem_get on zone device pages
@ 2025-04-23 19:21 jane.chu
  2025-04-23 19:34 ` Resend: " jane.chu
  2025-04-23 23:28 ` Jason Gunthorpe
  0 siblings, 2 replies; 10+ messages in thread
From: jane.chu @ 2025-04-23 19:21 UTC (permalink / raw)
  To: logane, hch, gregkh, jgg, willy, kch, axboe, linux-kernel,
	linux-mm, linux-pci, linux-nvme, linux-block
  Cc: jane.chu

Hi,

I recently looked at an mr cache registration regression issue that 
follows device-dax backed mr memory, not system RAM backed mr memory.

It boils down to
   1567b49d1a40 lib/scatterlist: add check when merging zone device pages
   [PATCH v11 5/9] lib/scatterlist: add check when merging zone device pages
   https://lore.kernel.org/all/20221021174116.7200-6-logang@deltatee.com/

that went into v6.2-rc1.

The line that introduced the regression is
   ib_uverbs_reg_mr
     mlx5_ib_reg_user_mr
       ib_umem_get
         sg_alloc_append_table_from_pages
           pages_are_mergeable
             zone_device_pages_have_same_pgmap(a,b)
               return a->pgmap == b->pgmap               <-------

Sub "return a->pgmap == b->pgmap" with "return true" purely as an 
experiment and the regression reliably went away.

So this looks like a case of CPU cache thrashing, but I don't know to 
fix it. Could someone help address the issue?  I'd be happy to help 
verifying.

My test system is a two-socket bare metal Intel(R) Xeon(R) Platinum 
8352Y with with 12 Intel NVDIMMs installed.

# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Model name:          Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
L1d cache:           48K        <----
L1i cache:           32K
L2 cache:            1280K
L3 cache:            49152K
NUMA node0 CPU(s):   0-31,64-95
NUMA node1 CPU(s):   32-63,96-127

# cat /proc/meminfo
MemTotal:       263744088 kB
MemFree:        252151828 kB
MemAvailable:   251806008 kB

There are 12 device-dax instances configured exactly the same -
# ndctl list -m devdax | egrep -m 1 'map'
     "map":"mem",
# ndctl list -m devdax | egrep -c 'map'
12
# ndctl list -m devdax
[
   {
     "dev":"namespace1.0",
     "mode":"devdax",
     "map":"mem",
     "size":135289372672,
     "uuid":"a67deda8-e5b3-4a6e-bea2-c1ebdc0fd996",
     "chardev":"dax1.0",
     "align":2097152
   },
[..]

The system is idle unless running mr registration test. The test 
attempts to register 61440 mrs by 64 threads in parallel, each mr is 2MB 
and is backed by device-dax memory.

The flow of a single test run:
   1. reserve virtual address space for (61440 * 2MB) via mmap with 
PROT_NONE and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
   2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the 
reserved virtual address space sequentially to form a continual VA space
   3. touch the entire mapped memory page by page
   4. take timestamp,
      create 40 pthreads, each thread registers (61440 / 40) mrs via 
ibv_reg_mr(),
      take another timestamp after pthread_join
   5. wait 10 seconds
   6. repeat step 4 except for deregistration via ibv_dereg_mr()
   7. tear down everything

I hope the above description is helpful as I am not at liberty to share 
the test code.

Here is the highlight from perfdiff comparing the culprit(PATCH 5/9) 
against the baseline(PATCH 4/9).

baseline = 49580e690755 block: add check when merging zone device pages
culprit  = 1567b49d1a40 lib/scatterlist: add check when merging zone 
device pages

# Baseline  Delta Abs  Shared Object              Symbol
# ........  .........  ......................... 
............................................................
#
     26.53%    -19.46%  [kernel.kallsyms]          [k] follow_page_mask
     49.15%    +11.56%  [kernel.kallsyms]          [k] 
native_queued_spin_lock_slowpath
                +1.38%  [kernel.kallsyms]          [k] 
pages_are_mergeable       <----
                +0.82%  [kernel.kallsyms]          [k] 
__rdma_block_iter_next
      0.74%     +0.68%  [kernel.kallsyms]          [k] osq_lock
                +0.56%  [kernel.kallsyms]          [k] 
mlx5r_umr_update_mr_pas
      2.25%     +0.49%  [kernel.kallsyms]          [k] 
follow_pmd_mask.isra.0
      1.92%     +0.37%  [kernel.kallsyms]          [k] _raw_spin_lock
      1.13%     +0.35%  [kernel.kallsyms]          [k] __get_user_pages

With baseline, per mr registration takes ~2950 nanoseconds, +- 50ns,
with culprit, per mr registration takes ~6850 nanoseconds, +- 50ns.

Regards,
-jane


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Resend: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-23 19:21 Report: Performance regression from ib_umem_get on zone device pages jane.chu
@ 2025-04-23 19:34 ` jane.chu
  2025-04-23 23:28 ` Jason Gunthorpe
  1 sibling, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-04-23 19:34 UTC (permalink / raw)
  To: logang, hch, gregkh, jgg, willy, kch, axboe, linux-kernel,
	linux-mm, linux-pci, linux-nvme, linux-block

Resend due to a serious typo.

On 4/23/2025 12:21 PM, jane.chu@oracle.com wrote:
> Hi,
> 
> I recently looked at an mr cache registration regression issue that 
> follows device-dax backed mr memory, not system RAM backed mr memory.
> 
> It boils down to
>    1567b49d1a40 lib/scatterlist: add check when merging zone device pages
>    [PATCH v11 5/9] lib/scatterlist: add check when merging zone device 
> pages
>    https://lore.kernel.org/all/20221021174116.7200-6-logang@deltatee.com/
> 
> that went into v6.2-rc1.
> 
> The line that introduced the regression is
>    ib_uverbs_reg_mr
>      mlx5_ib_reg_user_mr
>        ib_umem_get
>          sg_alloc_append_table_from_pages
>            pages_are_mergeable
>              zone_device_pages_have_same_pgmap(a,b)
>                return a->pgmap == b->pgmap               <-------
> 
> Sub "return a->pgmap == b->pgmap" with "return true" purely as an 
> experiment and the regression reliably went away.
> 
> So this looks like a case of CPU cache thrashing, but I don't know how to 
> fix it. Could someone help address the issue?  I'd be happy to help 
> verifying.
> 
> My test system is a two-socket bare metal Intel(R) Xeon(R) Platinum 
> 8352Y with with 12 Intel NVDIMMs installed.
> 
> # lscpu
> Architecture:        x86_64
> CPU op-mode(s):      32-bit, 64-bit
> Model name:          Intel(R) Xeon(R) Platinum 8352Y CPU @ 2.20GHz
> L1d cache:           48K        <----
> L1i cache:           32K
> L2 cache:            1280K
> L3 cache:            49152K
> NUMA node0 CPU(s):   0-31,64-95
> NUMA node1 CPU(s):   32-63,96-127
> 
> # cat /proc/meminfo
> MemTotal:       263744088 kB
> MemFree:        252151828 kB
> MemAvailable:   251806008 kB
> 
> There are 12 device-dax instances configured exactly the same -
> # ndctl list -m devdax | egrep -m 1 'map'
>      "map":"mem",
> # ndctl list -m devdax | egrep -c 'map'
> 12
> # ndctl list -m devdax
> [
>    {
>      "dev":"namespace1.0",
>      "mode":"devdax",
>      "map":"mem",
>      "size":135289372672,
>      "uuid":"a67deda8-e5b3-4a6e-bea2-c1ebdc0fd996",
>      "chardev":"dax1.0",
>      "align":2097152
>    },
> [..]
> 
> The system is idle unless running mr registration test. The test 
> attempts to register 61440 mrs by 64 threads in parallel, each mr is 2MB 
> and is backed by device-dax memory.
> 
> The flow of a single test run:
>    1. reserve virtual address space for (61440 * 2MB) via mmap with 
> PROT_NONE and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>    2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the 
> reserved virtual address space sequentially to form a continual VA space
>    3. touch the entire mapped memory page by page
>    4. take timestamp,
>       create 40 pthreads, each thread registers (61440 / 40) mrs via 
> ibv_reg_mr(),
>       take another timestamp after pthread_join
>    5. wait 10 seconds
>    6. repeat step 4 except for deregistration via ibv_dereg_mr()
>    7. tear down everything
> 
> I hope the above description is helpful as I am not at liberty to share 
> the test code.
> 
> Here is the highlight from perfdiff comparing the culprit(PATCH 5/9) 
> against the baseline(PATCH 4/9).
> 
> baseline = 49580e690755 block: add check when merging zone device pages
> culprit  = 1567b49d1a40 lib/scatterlist: add check when merging zone 
> device pages
> 
> # Baseline  Delta Abs  Shared Object              Symbol
> # ........  .........  ......................... ............................................................
> #
>      26.53%    -19.46%  [kernel.kallsyms]          [k] follow_page_mask
>      49.15%    +11.56%  [kernel.kallsyms]          [k] 
> native_queued_spin_lock_slowpath
>                 +1.38%  [kernel.kallsyms]          [k] 
> pages_are_mergeable       <----
>                 +0.82%  [kernel.kallsyms]          [k] 
> __rdma_block_iter_next
>       0.74%     +0.68%  [kernel.kallsyms]          [k] osq_lock
>                 +0.56%  [kernel.kallsyms]          [k] 
> mlx5r_umr_update_mr_pas
>       2.25%     +0.49%  [kernel.kallsyms]          [k] 
> follow_pmd_mask.isra.0
>       1.92%     +0.37%  [kernel.kallsyms]          [k] _raw_spin_lock
>       1.13%     +0.35%  [kernel.kallsyms]          [k] __get_user_pages
> 
> With baseline, per mr registration takes ~2950 nanoseconds, +- 50ns,
> with culprit, per mr registration takes ~6850 nanoseconds, +- 50ns.
> 
> Regards,
> -jane



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-23 19:21 Report: Performance regression from ib_umem_get on zone device pages jane.chu
  2025-04-23 19:34 ` Resend: " jane.chu
@ 2025-04-23 23:28 ` Jason Gunthorpe
  2025-04-24  1:49   ` jane.chu
                     ` (2 more replies)
  1 sibling, 3 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2025-04-23 23:28 UTC (permalink / raw)
  To: jane.chu
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block

On Wed, Apr 23, 2025 at 12:21:15PM -0700, jane.chu@oracle.com wrote:

> So this looks like a case of CPU cache thrashing, but I don't know to fix
> it. Could someone help address the issue?  I'd be happy to help verifying.

I don't know that we can even really fix it if that is the cause.. But
it seems suspect, if you are only doing 2M at a time per CPU core then
that is only 512 struct pages or 32k of data. The GUP process will
have touched all of that if device-dax is not creating folios. So why
did it fall out of the cache?

If it is creating folios then maybe we can improve things by
recovering the folios before adding the pages.

Or is something weird going on like the device-dax is using 1G folios
and all of these pins and checks are sharing and bouncing the same
struct page cache lines?

Can the device-dax implement memfd_pin_folios()?

> The flow of a single test run:
>   1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
> and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>   2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
> reserved virtual address space sequentially to form a continual VA
> space

Like is there any chance that each of these 61440 VMA's is a single
2MB folio from device-dax, or could it be?

IIRC device-dax does could not use folios until 6.15 so I'm assuming
it is not folios even if it is a pmd mapping?

Jason

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-23 23:28 ` Jason Gunthorpe
@ 2025-04-24  1:49   ` jane.chu
  2025-04-24  2:55   ` jane.chu
  2025-04-24  5:35   ` jane.chu
  2 siblings, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-04-24  1:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block


On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 12:21:15PM -0700, jane.chu@oracle.com wrote:
> 
>> So this looks like a case of CPU cache thrashing, but I don't know to fix
>> it. Could someone help address the issue?  I'd be happy to help verifying.
> 
> I don't know that we can even really fix it if that is the cause.. But
> it seems suspect, if you are only doing 2M at a time per CPU core then
> that is only 512 struct pages or 32k of data. The GUP process will
> have touched all of that if device-dax is not creating folios. So why
> did it fall out of the cache?
> 
> If it is creating folios then maybe we can improve things by
> recovering the folios before adding the pages.
> 
> Or is something weird going on like the device-dax is using 1G folios
> and all of these pins and checks are sharing and bouncing the same
> struct page cache lines?

I used ndctl to create 12 device-dax instances in 2M alignment by 
default, and mmap the device-dax memory in 2M alignment and 2M-multiple 
size, that should lead to the default 2MB hugepage mapping.

> 
> Can the device-dax implement memfd_pin_folios()?

Could you elaborate? or perhaps Dan Williams could comment?

> 
>> The flow of a single test run:
>>    1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
>> and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>>    2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
>> reserved virtual address space sequentially to form a continual VA
>> space
> 
> Like is there any chance that each of these 61440 VMA's is a single
> 2MB folio from device-dax, or could it be?

That's 61440 mrs of 2MB each, they came from 12 device-dax.
The test process mmap them into its pre-reserved VMA, so the entire VMA 
range is 61440 * 2M = 122880MB, or about 31million 4K-pages.

When it comes to mr registration via ibv_reg_mr(), there'll be about 
31million of ->pgmap dereferences from "a->pgmap == b->pgmap", give the 
small L1 Dcache, that is how I see the cache thrashing happening.

> 
> IIRC device-dax does could not use folios until 6.15 so I'm assuming
> it is not folios even if it is a pmd mapping?

Probably not, there are very little change to device-dax, but Dan can 
correct me.

In theory, the problem could be observed by using any kind of zone 
device pages for the mrs, have you seen anything like this?

thanks,
-jane

> 
> Jason
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-23 23:28 ` Jason Gunthorpe
  2025-04-24  1:49   ` jane.chu
@ 2025-04-24  2:55   ` jane.chu
  2025-04-24  3:00     ` jane.chu
  2025-04-24  5:35   ` jane.chu
  2 siblings, 1 reply; 10+ messages in thread
From: jane.chu @ 2025-04-24  2:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block


On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
> IIRC device-dax does could not use folios until 6.15 so I'm assuming
> it is not folios even if it is a pmd mapping?

I just looked at 6.15-rc3, device-dax is not using folio. Maybe I'm 
missing some upcoming patches?

thanks,
-jane



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-24  2:55   ` jane.chu
@ 2025-04-24  3:00     ` jane.chu
  0 siblings, 0 replies; 10+ messages in thread
From: jane.chu @ 2025-04-24  3:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block


On 4/23/2025 7:55 PM, jane.chu@oracle.com wrote:
> 
> On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
>> IIRC device-dax does could not use folios until 6.15 so I'm assuming
>> it is not folios even if it is a pmd mapping?
> 
> I just looked at 6.15-rc3, device-dax is not using folio. Maybe I'm 
> missing some upcoming patches?

Oops, scratch that.  I'll test 6.15.

thanks,
-jane
> 
> thanks,
> -jane
> 
> 



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-23 23:28 ` Jason Gunthorpe
  2025-04-24  1:49   ` jane.chu
  2025-04-24  2:55   ` jane.chu
@ 2025-04-24  5:35   ` jane.chu
  2025-04-24 12:01     ` Jason Gunthorpe
  2 siblings, 1 reply; 10+ messages in thread
From: jane.chu @ 2025-04-24  5:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block


On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
>> The flow of a single test run:
>>    1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
>> and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>>    2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
>> reserved virtual address space sequentially to form a continual VA
>> space
> Like is there any chance that each of these 61440 VMA's is a single
> 2MB folio from device-dax, or could it be?
> 
> IIRC device-dax does could not use folios until 6.15 so I'm assuming
> it is not folios even if it is a pmd mapping?
> 

I just ran the mr registration stress test in 6.15-rc3, much better!

What's changed?  is it folio for device-dax?  none of the code in 
ib_umem_get() has changed though, it still loops through 'npages' doing

   pinned = pin_user_pages_fast(cur_base,
         min_t(unsigned long, npages, PAGE_SIZE / sizeof(struct page *)),
         gup_flags, page_list);
   ret = sg_alloc_append_table_from_pages(&umem->sgt_append, page_list, 
pinned, 0,
         pinned << PAGE_SHIFT, ib_dma_max_seg_size(device), npages, 
GFP_KERNEL);

for up to 64 4K-pages at a time, and zone_device_pages_have_same_pgmap() 
is expected to be called for each 4K page, showing no awareness of large 
folio.

thanks,
-jane




^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-24  5:35   ` jane.chu
@ 2025-04-24 12:01     ` Jason Gunthorpe
  2025-04-28 19:11       ` jane.chu
  0 siblings, 1 reply; 10+ messages in thread
From: Jason Gunthorpe @ 2025-04-24 12:01 UTC (permalink / raw)
  To: jane.chu
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block

On Wed, Apr 23, 2025 at 10:35:06PM -0700, jane.chu@oracle.com wrote:
> 
> On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
> > > The flow of a single test run:
> > >    1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
> > > and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
> > >    2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
> > > reserved virtual address space sequentially to form a continual VA
> > > space
> > Like is there any chance that each of these 61440 VMA's is a single
> > 2MB folio from device-dax, or could it be?
> > 
> > IIRC device-dax does could not use folios until 6.15 so I'm assuming
> > it is not folios even if it is a pmd mapping?
> 
> I just ran the mr registration stress test in 6.15-rc3, much better!
> 
> What's changed?  is it folio for device-dax?  none of the code in
> ib_umem_get() has changed though, it still loops through 'npages' doing

I don't know, it is kind of strange that it changed. If device-dax is
now using folios then it does change the access pattern to the struct
page array somewhat, especially it moves all the writes to the head
page of the 2MB section which maybe impacts the the caching?

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-24 12:01     ` Jason Gunthorpe
@ 2025-04-28 19:11       ` jane.chu
  2025-04-29 12:29         ` Jason Gunthorpe
  0 siblings, 1 reply; 10+ messages in thread
From: jane.chu @ 2025-04-28 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block



On 4/24/2025 5:01 AM, Jason Gunthorpe wrote:
> On Wed, Apr 23, 2025 at 10:35:06PM -0700, jane.chu@oracle.com wrote:
>>
>> On 4/23/2025 4:28 PM, Jason Gunthorpe wrote:
>>>> The flow of a single test run:
>>>>     1. reserve virtual address space for (61440 * 2MB) via mmap with PROT_NONE
>>>> and MAP_ANONYMOUS | MAP_NORESERVE| MAP_PRIVATE
>>>>     2. mmap ((61440 * 2MB) / 12) from each of the 12 device-dax to the
>>>> reserved virtual address space sequentially to form a continual VA
>>>> space
>>> Like is there any chance that each of these 61440 VMA's is a single
>>> 2MB folio from device-dax, or could it be?
>>>
>>> IIRC device-dax does could not use folios until 6.15 so I'm assuming
>>> it is not folios even if it is a pmd mapping?
>>
>> I just ran the mr registration stress test in 6.15-rc3, much better!
>>
>> What's changed?  is it folio for device-dax?  none of the code in
>> ib_umem_get() has changed though, it still loops through 'npages' doing
> 
> I don't know, it is kind of strange that it changed. If device-dax is
> now using folios then it does change the access pattern to the struct
> page array somewhat, especially it moves all the writes to the head
> page of the 2MB section which maybe impacts the the caching?

6.15-rc3 is orders of magnitude better.
Agreed that device-dax's using folio are likely the heros. I've yet to 
check the code and bisect, maybe pin_user_page_fast() adds folios to 
page_list[] instead of 4K pages?  if so, with 511/512 size reduction in 
page_list[], that could drastically improve the dowstream call 
performance in spite of the thrashing, that is, if thrashing is still there.

I'll report my findings.

Thanks,
-jane

> 
> Jason



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Report: Performance regression from ib_umem_get on zone device pages
  2025-04-28 19:11       ` jane.chu
@ 2025-04-29 12:29         ` Jason Gunthorpe
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Gunthorpe @ 2025-04-29 12:29 UTC (permalink / raw)
  To: jane.chu
  Cc: logane, hch, gregkh, willy, kch, axboe, linux-kernel, linux-mm,
	linux-pci, linux-nvme, linux-block

On Mon, Apr 28, 2025 at 12:11:40PM -0700, jane.chu@oracle.com wrote:

> 6.15-rc3 is orders of magnitude better.
> Agreed that device-dax's using folio are likely the heros. I've yet to check
> the code and bisect, maybe pin_user_page_fast() adds folios to page_list[]
> instead of 4K pages?

It does not.

I think a bisection would be interesting information

Jason


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-04-29 12:30 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-23 19:21 Report: Performance regression from ib_umem_get on zone device pages jane.chu
2025-04-23 19:34 ` Resend: " jane.chu
2025-04-23 23:28 ` Jason Gunthorpe
2025-04-24  1:49   ` jane.chu
2025-04-24  2:55   ` jane.chu
2025-04-24  3:00     ` jane.chu
2025-04-24  5:35   ` jane.chu
2025-04-24 12:01     ` Jason Gunthorpe
2025-04-28 19:11       ` jane.chu
2025-04-29 12:29         ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox