Re: [RFC PATCH 0/4] KVM: ioctl for populating guest

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
       [not found] <20241024095429.54052-1-kalyazin@amazon.com>
@ 2024-11-20 12:09 ` Nikita Kalyazin
  2024-11-20 13:46   ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 12:09 UTC (permalink / raw)
  To: pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, David Hildenbrand,
	Sean Christopherson, linux-mm

On 24/10/2024 10:54, Nikita Kalyazin wrote:
> [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.

+David
+Sean
+mm

While measuring memory population performance of guest_memfd using this 
series, I noticed that guest_memfd population takes longer than my 
baseline, which is filling anonymous private memory via UFFDIO_COPY.

I am using x86_64 for my measurements and 3 GiB memory region:
  - anon/private UFFDIO_COPY:  940 ms
  - guest_memfd:              1371 ms (+46%)

It turns out that the effect is observable not only for guest_memfd, but 
also for any type of shared memory, eg memfd or anonymous memory mapped 
as shared.

Below are measurements of a plain mmap(MAP_POPULATE) operation:

mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
  vs
mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED | 
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);

Results:
  - MAP_PRIVATE: 968 ms
  - MAP_SHARED: 1646 ms

I am seeing this effect on a range of kernels. The oldest I used was 
5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).

When profiling with perf, I observe the following hottest operations 
(kvm-next). Attaching full distributions at the end of the email.

MAP_PRIVATE:
- 19.72% clear_page_erms, rep stos %al,%es:(%rdi)

MAP_SHARED:
- 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic 
setting of the PG_uptodate bit
- 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that 
sets the PG_uptodate bit regularly.
, while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that 
sets the PG_uptodate bit atomically.

While this logic is intuitive, its performance effect is more 
significant that I would expect.

The questions are:
  - Is this a well-known behaviour?
  - Is there a way to mitigate that, ie make shared memory (including 
guest_memfd) population faster/comparable to private memory?

Nikita


Appendix: full call tree obtained via perf

MAP_RPIVATE:

       - 87.97% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 77.94% handle_mm_fault
                - 76.90% __handle_mm_fault
                   - 72.70% do_anonymous_page
                      - 31.92% vma_alloc_folio_noprof
                         - 30.74% alloc_pages_mpol_noprof
                            - 29.60% __alloc_pages_noprof
                               - 28.40% get_page_from_freelist
                                    19.72% clear_page_erms
                                  - 3.00% __rmqueue_pcplist
                                       __mod_zone_page_state
                                    1.18% _raw_spin_trylock
                      - 20.03% __pte_offset_map_lock
                         - 15.96% _raw_spin_lock
                              1.50% preempt_count_add
                         - 2.27% __pte_offset_map
                              __rcu_read_lock
                      - 7.22% __folio_batch_add_and_move
                         - 4.68% folio_batch_move_lru
                            - 3.77% lru_add
                               + 0.95% __mod_zone_page_state
                                 0.86% __mod_node_page_state
                           0.84% folios_put_refs
                           0.55% check_preemption_disabled
                      - 2.85% folio_add_new_anon_rmap
                         - __folio_mod_stat
                              __mod_node_page_state
                   - 1.15% pte_offset_map_nolock
                        __pte_offset_map
             - 7.59% follow_page_pte
                - 4.56% __pte_offset_map_lock
                   - 2.27% _raw_spin_lock
                        preempt_count_add
                     1.13% __pte_offset_map
                  0.75% folio_mark_accessed

MAP_SHARED:

       - 77.89% __mmap
            entry_SYSCALL_64_after_hwframe
            do_syscall_64
            vm_mmap_pgoff
            __mm_populate
            populate_vma_page_range
          - __get_user_pages
             - 72.11% handle_mm_fault
                - 71.67% __handle_mm_fault
                   - 69.62% do_fault
                      - 44.61% __do_fault
                         - shmem_fault
                            - 43.94% shmem_get_folio_gfp
                               - 17.20% 
shmem_alloc_and_add_folio.constprop.0
                                  - 5.10% shmem_alloc_folio
                                     - 4.58% folio_alloc_mpol_noprof
                                        - alloc_pages_mpol_noprof
                                           - 4.00% __alloc_pages_noprof
                                              - 3.31% get_page_from_freelist
                                                   1.24% __rmqueue_pcplist
                                  - 5.07% shmem_add_to_page_cache
                                     - 1.44% __mod_node_page_state
                                          0.61% check_preemption_disabled
                                       0.78% xas_store
                                       0.74% xas_find_conflict
                                       0.66% _raw_spin_lock_irq
                                  - 3.96% __folio_batch_add_and_move
                                     - 2.41% folio_batch_move_lru
                                          1.88% lru_add
                                  - 1.56% shmem_inode_acct_blocks
                                     - 1.24% __dquot_alloc_space
                                        - 0.77% inode_add_bytes
                                             _raw_spin_lock
                                  - 0.77% shmem_recalc_inode
                                       _raw_spin_lock
                                 10.98% clear_page_erms
                               - 1.17% filemap_get_entry
                                    0.78% xas_load
                      - 20.26% filemap_map_pages
                         - 12.23% next_uptodate_folio
                            - 1.27% xas_find
                                 xas_load
                         - 1.16% __pte_offset_map_lock
                              0.59% _raw_spin_lock
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
                     0.57% pte_offset_map_nolock
             - 4.11% follow_page_pte
                - 2.36% __pte_offset_map_lock
                   - 1.32% _raw_spin_lock
                        preempt_count_add
                     0.54% __pte_offset_map


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
@ 2024-11-20 13:46   ` David Hildenbrand
  2024-11-20 15:13     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 13:46 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 13:09, Nikita Kalyazin wrote:
> On 24/10/2024 10:54, Nikita Kalyazin wrote:
>> [2] proposes an alternative to
>> UserfaultFD for intercepting stage-2 faults, while this series
>> conceptually compliments it with the ability to populate guest memory
>> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
> 
> +David
> +Sean
> +mm

Hi!

> 
> While measuring memory population performance of guest_memfd using this
> series, I noticed that guest_memfd population takes longer than my
> baseline, which is filling anonymous private memory via UFFDIO_COPY.
> 
> I am using x86_64 for my measurements and 3 GiB memory region:
>    - anon/private UFFDIO_COPY:  940 ms
>    - guest_memfd:              1371 ms (+46%)
> 
> It turns out that the effect is observable not only for guest_memfd, but
> also for any type of shared memory, eg memfd or anonymous memory mapped
> as shared.
 > Below are measurements of a plain mmap(MAP_POPULATE) operation:>
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>    vs
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> 
> Results:
>    - MAP_PRIVATE: 968 ms
>    - MAP_SHARED: 1646 ms

At least here it is expected to some degree: as soon as the page cache 
is involved map/unmap gets slower, because we are effectively 
maintaining two datastructures (page tables + page cache) instead of 
only a single one (page cache)

Can you make sure that THP/large folios don't interfere in your 
experiments (e.g., madvise(MADV_NOHUGEPAGE))?

> 
> I am seeing this effect on a range of kernels. The oldest I used was
> 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).
> 
> When profiling with perf, I observe the following hottest operations
> (kvm-next). Attaching full distributions at the end of the email.
> 
> MAP_PRIVATE:
> - 19.72% clear_page_erms, rep stos %al,%es:(%rdi)
> 
> MAP_SHARED:
> - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic
> setting of the PG_uptodate bit
> - 10.98% clear_page_erms, rep stos %al,%es:(%rdi)

Interesting.
> 
> Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that
> sets the PG_uptodate bit regularly.
> , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that
> sets the PG_uptodate bit atomically.
> 
> While this logic is intuitive, its performance effect is more
> significant that I would expect.

Yes. How much of the performance difference would remain if you hack out 
the atomic op just to play with it? I suspect there will still be some 
difference.

> 
> The questions are:
>    - Is this a well-known behaviour?
>    - Is there a way to mitigate that, ie make shared memory (including
> guest_memfd) population faster/comparable to private memory?

Likely. But your experiment measures above something different than what 
guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I 
would assume guest_memfd will be faster than MAP_POPULATE.

How do you end up allocating memory for guest_memfd? Using simple 
fallocate()?

Note that we might improve allocation times with guest_memfd when 
allocating larger folios.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 13:46   ` David Hildenbrand
@ 2024-11-20 15:13     ` David Hildenbrand
  2024-11-20 15:58       ` Nikita Kalyazin
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 15:13 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

>>
>> The questions are:
>>     - Is this a well-known behaviour?
>>     - Is there a way to mitigate that, ie make shared memory (including
>> guest_memfd) population faster/comparable to private memory?
> 
> Likely. But your experiment measures above something different than what
> guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I
> would assume guest_memfd will be faster than MAP_POPULATE.
> 
> How do you end up allocating memory for guest_memfd? Using simple
> fallocate()?

Heh, now I spot that your comment was as reply to a series.

If your ioctl is supposed to to more than "allocating memory" like 
MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice. 
Because for allocating memory, we would want to use fallocate() instead. 
I assume you want to "allocate+copy"?

I'll note that, as we're moving into the direction of moving 
guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*" 
ioctls, and think about something generic.

Any clue how your new ioctl will interact with the WIP to have shared 
memory as part of guest_memfd? For example, could it be reasonable to 
"populate" the shared memory first (via VMA) and then convert that 
"allocated+filled" memory to private?

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 15:13     ` David Hildenbrand
@ 2024-11-20 15:58       ` Nikita Kalyazin
  2024-11-20 16:20         ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 15:58 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 15:13, David Hildenbrand wrote:
 > Hi!

Hi! :)

 >> Results:
 >>    - MAP_PRIVATE: 968 ms
 >>    - MAP_SHARED: 1646 ms
 >
 > At least here it is expected to some degree: as soon as the page cache
 > is involved map/unmap gets slower, because we are effectively
 > maintaining two datastructures (page tables + page cache) instead of
 > only a single one (page cache)
 >
 > Can you make sure that THP/large folios don't interfere in your
 > experiments (e.g., madvise(MADV_NOHUGEPAGE))?

I was using transparent_hugepage=never command line argument in my testing.

$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Is that sufficient to exclude the THP/large folio factor?

 >> While this logic is intuitive, its performance effect is more
 >> significant that I would expect.
 >
 > Yes. How much of the performance difference would remain if you hack out
 > the atomic op just to play with it? I suspect there will still be some
 > difference.

I have tried that, but could not see any noticeable difference in the 
overall results.

It looks like a big portion of the bottleneck has moved from 
shmem_get_folio_gfp/folio_mark_uptodate to 
finish_fault/__pte_offset_map_lock somehow.  I have no good explanation 
for why:

Orig:
                   - 69.62% do_fault
                      + 44.61% __do_fault
                      + 20.26% filemap_map_pages
                      + 3.48% finish_fault
Hacked:
                   - 67.39% do_fault
                      + 32.45% __do_fault
                      + 21.87% filemap_map_pages
                      + 11.97% finish_fault

Orig:
                      - 3.48% finish_fault
                         - 1.28% set_pte_range
                              0.96% folio_add_file_rmap_ptes
                         - 0.91% __pte_offset_map_lock
                              0.54% _raw_spin_lock
Hacked:
                      - 11.97% finish_fault
                         - 8.59% __pte_offset_map_lock
                            - 6.27% _raw_spin_lock
                                 preempt_count_add
                              1.00% __pte_offset_map
                         - 1.28% set_pte_range
                            - folio_add_file_rmap_ptes
                                 __mod_node_page_state

 > Note that we might improve allocation times with guest_memfd when
 > allocating larger folios.

I suppose it may not always be an option depending on requirements to 
consistency of the allocation latency.  Eg if a large folio isn't 
available at the time, the performance would degrade to the base case 
(please correct me if I'm missing something).

> Heh, now I spot that your comment was as reply to a series.

Yeah, sorry if it wasn't obvious.

> If your ioctl is supposed to to more than "allocating memory" like
> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
> Because for allocating memory, we would want to use fallocate() instead.
> I assume you want to "allocate+copy"?

Yes, the ultimate use case is "allocate+copy".

> I'll note that, as we're moving into the direction of moving
> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
> ioctls, and think about something generic.

Good point, thanks.  Are we at the stage where some concrete API has 
been proposed yet? I might have missed that.

> Any clue how your new ioctl will interact with the WIP to have shared
> memory as part of guest_memfd? For example, could it be reasonable to
> "populate" the shared memory first (via VMA) and then convert that
> "allocated+filled" memory to private?

No, I can't immediately see why it shouldn't work.  My main concern 
would probably still be about the latency of the population stage as I 
can't see why it would improve compared to what we have now, because my 
feeling is this is linked with the sharedness property of guest_memfd.

> Cheers,
> 
> David / dhildenb




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 15:58       ` Nikita Kalyazin
@ 2024-11-20 16:20         ` David Hildenbrand
  2024-11-20 16:44           ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:20 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 16:58, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 15:13, David Hildenbrand wrote:
>   > Hi!
> 
> Hi! :)
> 
>   >> Results:
>   >>    - MAP_PRIVATE: 968 ms
>   >>    - MAP_SHARED: 1646 ms
>   >
>   > At least here it is expected to some degree: as soon as the page cache
>   > is involved map/unmap gets slower, because we are effectively
>   > maintaining two datastructures (page tables + page cache) instead of
>   > only a single one (page cache)
>   >
>   > Can you make sure that THP/large folios don't interfere in your
>   > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
> 
> I was using transparent_hugepage=never command line argument in my testing.
> 
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
> 
> Is that sufficient to exclude the THP/large folio factor?

Yes!

> 
>   >> While this logic is intuitive, its performance effect is more
>   >> significant that I would expect.
>   >
>   > Yes. How much of the performance difference would remain if you hack out
>   > the atomic op just to play with it? I suspect there will still be some
>   > difference.
> 
> I have tried that, but could not see any noticeable difference in the
> overall results.
> 
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow.  I have no good explanation
> for why:

That's what I assumed. The profiling results can be rather fuzzy and 
misleading with micro-benchmarks. :(

> 
> Orig:
>                     - 69.62% do_fault
>                        + 44.61% __do_fault
>                        + 20.26% filemap_map_pages
>                        + 3.48% finish_fault
> Hacked:
>                     - 67.39% do_fault
>                        + 32.45% __do_fault
>                        + 21.87% filemap_map_pages
>                        + 11.97% finish_fault
> 
> Orig:
>                        - 3.48% finish_fault
>                           - 1.28% set_pte_range
>                                0.96% folio_add_file_rmap_ptes
>                           - 0.91% __pte_offset_map_lock
>                                0.54% _raw_spin_lock
> Hacked:
>                        - 11.97% finish_fault
>                           - 8.59% __pte_offset_map_lock
>                              - 6.27% _raw_spin_lock
>                                   preempt_count_add
>                                1.00% __pte_offset_map
>                           - 1.28% set_pte_range
>                              - folio_add_file_rmap_ptes
>                                   __mod_node_page_state
> 
>   > Note that we might improve allocation times with guest_memfd when
>   > allocating larger folios.
> 
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency.  Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).

Yes, there are cons to that.

> 
>> Heh, now I spot that your comment was as reply to a series.
> 
> Yeah, sorry if it wasn't obvious.
> 
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
> 
> Yes, the ultimate use case is "allocate+copy".
> 
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
> 
> Good point, thanks.  Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.

People are working on it, and we're figuring out some remaining details 
(e.g., page_type to intercept folio_put() ). I assume we'll see a new 
RFC soonish (famous last words), but it's not been proposed yet.

> 
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
> 
> No, I can't immediately see why it shouldn't work.  My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
 > feeling is this is linked with the sharedness property of guest_memfd.

If the problem is the "pagecache" overhead, then yes, it will be a 
harder nut to crack. But maybe there are some low-hanging fruits to 
optimize? Finding the main cause for the added overhead would be 
interesting.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 16:20         ` David Hildenbrand
@ 2024-11-20 16:44           ` David Hildenbrand
  2024-11-20 17:21             ` Nikita Kalyazin
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:44 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

>> No, I can't immediately see why it shouldn't work.  My main concern
>> would probably still be about the latency of the population stage as I
>> can't see why it would improve compared to what we have now, because my
>   > feeling is this is linked with the sharedness property of guest_memfd.
> 
> If the problem is the "pagecache" overhead, then yes, it will be a
> harder nut to crack. But maybe there are some low-hanging fruits to
> optimize? Finding the main cause for the added overhead would be
> interesting.

Can you compare uffdio_copy() when using anonymous memory vs. shmem? 
That's likely the best we could currently achieve with guest_memfd.

There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure 
if that is of any help; it SEGFAULTS for me right now with a (likely) 
division by 0.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 16:44           ` David Hildenbrand
@ 2024-11-20 17:21             ` Nikita Kalyazin
  2024-11-20 18:29               ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 17:21 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 16:44, David Hildenbrand wrote:
>> If the problem is the "pagecache" overhead, then yes, it will be a
>> harder nut to crack. But maybe there are some low-hanging fruits to
>> optimize? Finding the main cause for the added overhead would be
>> interesting.

Agreed, knowing the exact root cause would be really nice.

> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
> That's likely the best we could currently achieve with guest_memfd.

Yeah, I was doing that too. It was about ~28% slower in my setup, while 
with guest_memfd it was ~34% slower.  The variance of the data was quite 
high so the difference may well be just noise.  In other words, I'd be 
much happier if we could bring guest_memfd (or even shmem) performance 
closer to the anon/private than if we just equalised guest_memfd with 
shmem (which are probably already pretty close).

> There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure
> if that is of any help; it SEGFAULTS for me right now with a (likely)
> division by 0.

Thanks for the pointer, will take a look!

> Cheers,
> 
> David / dhildenb
> 




^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 17:21             ` Nikita Kalyazin
@ 2024-11-20 18:29               ` David Hildenbrand
  2024-11-21 16:46                 ` Nikita Kalyazin
  0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 18:29 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 20.11.24 18:21, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 16:44, David Hildenbrand wrote:
>>> If the problem is the "pagecache" overhead, then yes, it will be a
>>> harder nut to crack. But maybe there are some low-hanging fruits to
>>> optimize? Finding the main cause for the added overhead would be
>>> interesting.
> 
> Agreed, knowing the exact root cause would be really nice.
> 
>> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
>> That's likely the best we could currently achieve with guest_memfd.
> 
> Yeah, I was doing that too. It was about ~28% slower in my setup, while
> with guest_memfd it was ~34% slower. 

I looked into uffdio_copy() for shmem and we still walk+modify page 
tables. In theory, we could try hacking that out: for filling the 
pagecache we would only need the vma properties, not the page table 
properties; that would then really resemble "only modify the pagecache".

That would likely resemble what we would expect with guest_memfd: work 
only on the pagecache and not the page tables. So it's rather surprising 
that guest_memfd is slower than that, as it currently doesn't mess with 
user page tables at all.

  The variance of the data was quite
> high so the difference may well be just noise.  In other words, I'd be
> much happier if we could bring guest_memfd (or even shmem) performance
> closer to the anon/private than if we just equalised guest_memfd with
> shmem (which are probably already pretty close).

Makes sense. Best we can do is:

anon: work only on page tables
shmem/guest_memfd: work only on pageacache

So at least "only one treelike structure to update".

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-20 18:29               ` David Hildenbrand
@ 2024-11-21 16:46                 ` Nikita Kalyazin
  2024-11-26 16:04                   ` Nikita Kalyazin
  0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-21 16:46 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 20/11/2024 18:29, David Hildenbrand wrote:
 > Any clue how your new ioctl will interact with the WIP to have shared
 > memory as part of guest_memfd? For example, could it be reasonable to
 > "populate" the shared memory first (via VMA) and then convert that
 > "allocated+filled" memory to private?

Patrick and I synced internally on this.  What may actually work for 
guest_memfd population is the following.

Non-CoCo use case:
  - fallocate syscall to fill the page cache, no page content 
initialisation (like it is now)
  - pwrite syscall to initialise the content + mark up-to-date (mark 
prepared), no specific preparation logic is required

The pwrite will have "once" semantics until a subsequent 
fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the 
page is already prepared and return EIO/ENOSPC or something.

SEV-SNP use case (no changes):
  - fallocate as above
  - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare

We don't think fallocate/pwrite have dependencies on current->mm 
assumptions that Paolo mentioned in [1], so they should be safe to be 
called on guest_memfd from a non-VMM process.

[1]: 
https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c

 > Makes sense. Best we can do is:
 >
 > anon: work only on page tables
 > shmem/guest_memfd: work only on pageacache
 >
 > So at least "only one treelike structure to update".

This seems to hold with the above reasoning.

 > --
> Cheers,
> 
> David / dhildenb 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-21 16:46                 ` Nikita Kalyazin
@ 2024-11-26 16:04                   ` Nikita Kalyazin
  2024-11-28 12:11                     ` David Hildenbrand
  0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-26 16:04 UTC (permalink / raw)
  To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm



On 21/11/2024 16:46, Nikita Kalyazin wrote:
> 
> 
> On 20/11/2024 18:29, David Hildenbrand wrote:
>  > Any clue how your new ioctl will interact with the WIP to have shared
>  > memory as part of guest_memfd? For example, could it be reasonable to
>  > "populate" the shared memory first (via VMA) and then convert that
>  > "allocated+filled" memory to private?
> 
> Patrick and I synced internally on this.  What may actually work for 
> guest_memfd population is the following.
> 
> Non-CoCo use case:
>   - fallocate syscall to fill the page cache, no page content 
> initialisation (like it is now)
>   - pwrite syscall to initialise the content + mark up-to-date (mark 
> prepared), no specific preparation logic is required
> 
> The pwrite will have "once" semantics until a subsequent 
> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the 
> page is already prepared and return EIO/ENOSPC or something.

I prototyped that to see if it was possible (and it was).  Actually the 
write syscall can also do the allocation part, so no prior fallocate 
would be required.  The only thing is there is a cap on how much IO can 
be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a 
significant problem.  Does it sound like an acceptable solution?

[1]: https://elixir.bootlin.com/linux/v6.12.1/source/fs/read_write.c#L507

> 
> SEV-SNP use case (no changes):
>   - fallocate as above
>   - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare
> 
> We don't think fallocate/pwrite have dependencies on current->mm 
> assumptions that Paolo mentioned in [1], so they should be safe to be 
> called on guest_memfd from a non-VMM process.
> 
> [1]: https://lore.kernel.org/kvm/20241024095429.54052-1- 
> kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c
> 
>  > Makes sense. Best we can do is:
>  >
>  > anon: work only on page tables
>  > shmem/guest_memfd: work only on pageacache
>  >
>  > So at least "only one treelike structure to update".
> 
> This seems to hold with the above reasoning.
> 
>  > --
>> Cheers,
>>
>> David / dhildenb 
> 



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
  2024-11-26 16:04                   ` Nikita Kalyazin
@ 2024-11-28 12:11                     ` David Hildenbrand
  0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2024-11-28 12:11 UTC (permalink / raw)
  To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
  Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
	derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm

On 26.11.24 17:04, Nikita Kalyazin wrote:
> 
> 
> On 21/11/2024 16:46, Nikita Kalyazin wrote:
>>
>>
>> On 20/11/2024 18:29, David Hildenbrand wrote:
>>   > Any clue how your new ioctl will interact with the WIP to have shared
>>   > memory as part of guest_memfd? For example, could it be reasonable to
>>   > "populate" the shared memory first (via VMA) and then convert that
>>   > "allocated+filled" memory to private?
>>
>> Patrick and I synced internally on this.  What may actually work for
>> guest_memfd population is the following.
>>
>> Non-CoCo use case:
>>    - fallocate syscall to fill the page cache, no page content
>> initialisation (like it is now)
>>    - pwrite syscall to initialise the content + mark up-to-date (mark
>> prepared), no specific preparation logic is required
>>
>> The pwrite will have "once" semantics until a subsequent
>> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the
>> page is already prepared and return EIO/ENOSPC or something.
> 
> I prototyped that to see if it was possible (and it was).  Actually the
> write syscall can also do the allocation part, so no prior fallocate
> would be required. 

Right

> The only thing is there is a cap on how much IO can
> be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a
> significant problem.  Does it sound like an acceptable solution?

Does sound quite clean to me. Of course, one thing to figure out is how 
to enable this only for that special type of VM type, but that should be 
possible to be resolved.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-11-28 12:11 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20241024095429.54052-1-kalyazin@amazon.com>
2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
2024-11-20 13:46   ` David Hildenbrand
2024-11-20 15:13     ` David Hildenbrand
2024-11-20 15:58       ` Nikita Kalyazin
2024-11-20 16:20         ` David Hildenbrand
2024-11-20 16:44           ` David Hildenbrand
2024-11-20 17:21             ` Nikita Kalyazin
2024-11-20 18:29               ` David Hildenbrand
2024-11-21 16:46                 ` Nikita Kalyazin
2024-11-26 16:04                   ` Nikita Kalyazin
2024-11-28 12:11                     ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox