* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
[not found] <20241024095429.54052-1-kalyazin@amazon.com>
@ 2024-11-20 12:09 ` Nikita Kalyazin
2024-11-20 13:46 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 12:09 UTC (permalink / raw)
To: pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, David Hildenbrand,
Sean Christopherson, linux-mm
On 24/10/2024 10:54, Nikita Kalyazin wrote:
> [2] proposes an alternative to
> UserfaultFD for intercepting stage-2 faults, while this series
> conceptually compliments it with the ability to populate guest memory
> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
+David
+Sean
+mm
While measuring memory population performance of guest_memfd using this
series, I noticed that guest_memfd population takes longer than my
baseline, which is filling anonymous private memory via UFFDIO_COPY.
I am using x86_64 for my measurements and 3 GiB memory region:
- anon/private UFFDIO_COPY: 940 ms
- guest_memfd: 1371 ms (+46%)
It turns out that the effect is observable not only for guest_memfd, but
also for any type of shared memory, eg memfd or anonymous memory mapped
as shared.
Below are measurements of a plain mmap(MAP_POPULATE) operation:
mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE |
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
vs
mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED |
MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
Results:
- MAP_PRIVATE: 968 ms
- MAP_SHARED: 1646 ms
I am seeing this effect on a range of kernels. The oldest I used was
5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).
When profiling with perf, I observe the following hottest operations
(kvm-next). Attaching full distributions at the end of the email.
MAP_PRIVATE:
- 19.72% clear_page_erms, rep stos %al,%es:(%rdi)
MAP_SHARED:
- 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic
setting of the PG_uptodate bit
- 10.98% clear_page_erms, rep stos %al,%es:(%rdi)
Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that
sets the PG_uptodate bit regularly.
, while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that
sets the PG_uptodate bit atomically.
While this logic is intuitive, its performance effect is more
significant that I would expect.
The questions are:
- Is this a well-known behaviour?
- Is there a way to mitigate that, ie make shared memory (including
guest_memfd) population faster/comparable to private memory?
Nikita
Appendix: full call tree obtained via perf
MAP_RPIVATE:
- 87.97% __mmap
entry_SYSCALL_64_after_hwframe
do_syscall_64
vm_mmap_pgoff
__mm_populate
populate_vma_page_range
- __get_user_pages
- 77.94% handle_mm_fault
- 76.90% __handle_mm_fault
- 72.70% do_anonymous_page
- 31.92% vma_alloc_folio_noprof
- 30.74% alloc_pages_mpol_noprof
- 29.60% __alloc_pages_noprof
- 28.40% get_page_from_freelist
19.72% clear_page_erms
- 3.00% __rmqueue_pcplist
__mod_zone_page_state
1.18% _raw_spin_trylock
- 20.03% __pte_offset_map_lock
- 15.96% _raw_spin_lock
1.50% preempt_count_add
- 2.27% __pte_offset_map
__rcu_read_lock
- 7.22% __folio_batch_add_and_move
- 4.68% folio_batch_move_lru
- 3.77% lru_add
+ 0.95% __mod_zone_page_state
0.86% __mod_node_page_state
0.84% folios_put_refs
0.55% check_preemption_disabled
- 2.85% folio_add_new_anon_rmap
- __folio_mod_stat
__mod_node_page_state
- 1.15% pte_offset_map_nolock
__pte_offset_map
- 7.59% follow_page_pte
- 4.56% __pte_offset_map_lock
- 2.27% _raw_spin_lock
preempt_count_add
1.13% __pte_offset_map
0.75% folio_mark_accessed
MAP_SHARED:
- 77.89% __mmap
entry_SYSCALL_64_after_hwframe
do_syscall_64
vm_mmap_pgoff
__mm_populate
populate_vma_page_range
- __get_user_pages
- 72.11% handle_mm_fault
- 71.67% __handle_mm_fault
- 69.62% do_fault
- 44.61% __do_fault
- shmem_fault
- 43.94% shmem_get_folio_gfp
- 17.20%
shmem_alloc_and_add_folio.constprop.0
- 5.10% shmem_alloc_folio
- 4.58% folio_alloc_mpol_noprof
- alloc_pages_mpol_noprof
- 4.00% __alloc_pages_noprof
- 3.31% get_page_from_freelist
1.24% __rmqueue_pcplist
- 5.07% shmem_add_to_page_cache
- 1.44% __mod_node_page_state
0.61% check_preemption_disabled
0.78% xas_store
0.74% xas_find_conflict
0.66% _raw_spin_lock_irq
- 3.96% __folio_batch_add_and_move
- 2.41% folio_batch_move_lru
1.88% lru_add
- 1.56% shmem_inode_acct_blocks
- 1.24% __dquot_alloc_space
- 0.77% inode_add_bytes
_raw_spin_lock
- 0.77% shmem_recalc_inode
_raw_spin_lock
10.98% clear_page_erms
- 1.17% filemap_get_entry
0.78% xas_load
- 20.26% filemap_map_pages
- 12.23% next_uptodate_folio
- 1.27% xas_find
xas_load
- 1.16% __pte_offset_map_lock
0.59% _raw_spin_lock
- 3.48% finish_fault
- 1.28% set_pte_range
0.96% folio_add_file_rmap_ptes
- 0.91% __pte_offset_map_lock
0.54% _raw_spin_lock
0.57% pte_offset_map_nolock
- 4.11% follow_page_pte
- 2.36% __pte_offset_map_lock
- 1.32% _raw_spin_lock
preempt_count_add
0.54% __pte_offset_map
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
@ 2024-11-20 13:46 ` David Hildenbrand
2024-11-20 15:13 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 13:46 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20.11.24 13:09, Nikita Kalyazin wrote:
> On 24/10/2024 10:54, Nikita Kalyazin wrote:
>> [2] proposes an alternative to
>> UserfaultFD for intercepting stage-2 faults, while this series
>> conceptually compliments it with the ability to populate guest memory
>> backed by guest_memfd for `KVM_X86_SW_PROTECTED_VM` VMs.
>
> +David
> +Sean
> +mm
Hi!
>
> While measuring memory population performance of guest_memfd using this
> series, I noticed that guest_memfd population takes longer than my
> baseline, which is filling anonymous private memory via UFFDIO_COPY.
>
> I am using x86_64 for my measurements and 3 GiB memory region:
> - anon/private UFFDIO_COPY: 940 ms
> - guest_memfd: 1371 ms (+46%)
>
> It turns out that the effect is observable not only for guest_memfd, but
> also for any type of shared memory, eg memfd or anonymous memory mapped
> as shared.
> Below are measurements of a plain mmap(MAP_POPULATE) operation:>
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_PRIVATE |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
> vs
> mmap(NULL, 3ll * (1 << 30), PROT_READ | PROT_WRITE, MAP_SHARED |
> MAP_ANONYMOUS | MAP_POPULATE, -1, 0);
>
> Results:
> - MAP_PRIVATE: 968 ms
> - MAP_SHARED: 1646 ms
At least here it is expected to some degree: as soon as the page cache
is involved map/unmap gets slower, because we are effectively
maintaining two datastructures (page tables + page cache) instead of
only a single one (page cache)
Can you make sure that THP/large folios don't interfere in your
experiments (e.g., madvise(MADV_NOHUGEPAGE))?
>
> I am seeing this effect on a range of kernels. The oldest I used was
> 5.10, the newest is the current kvm-next (for-linus-2590-gd96c77bd4eeb).
>
> When profiling with perf, I observe the following hottest operations
> (kvm-next). Attaching full distributions at the end of the email.
>
> MAP_PRIVATE:
> - 19.72% clear_page_erms, rep stos %al,%es:(%rdi)
>
> MAP_SHARED:
> - 43.94% shmem_get_folio_gfp, lock orb $0x8,(%rdi), which is atomic
> setting of the PG_uptodate bit
> - 10.98% clear_page_erms, rep stos %al,%es:(%rdi)
Interesting.
>
> Note that MAP_PRIVATE/do_anonymous_page calls __folio_mark_uptodate that
> sets the PG_uptodate bit regularly.
> , while MAP_SHARED/shmem_get_folio_gfp calls folio_mark_uptodate that
> sets the PG_uptodate bit atomically.
>
> While this logic is intuitive, its performance effect is more
> significant that I would expect.
Yes. How much of the performance difference would remain if you hack out
the atomic op just to play with it? I suspect there will still be some
difference.
>
> The questions are:
> - Is this a well-known behaviour?
> - Is there a way to mitigate that, ie make shared memory (including
> guest_memfd) population faster/comparable to private memory?
Likely. But your experiment measures above something different than what
guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I
would assume guest_memfd will be faster than MAP_POPULATE.
How do you end up allocating memory for guest_memfd? Using simple
fallocate()?
Note that we might improve allocation times with guest_memfd when
allocating larger folios.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 13:46 ` David Hildenbrand
@ 2024-11-20 15:13 ` David Hildenbrand
2024-11-20 15:58 ` Nikita Kalyazin
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 15:13 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
>>
>> The questions are:
>> - Is this a well-known behaviour?
>> - Is there a way to mitigate that, ie make shared memory (including
>> guest_memfd) population faster/comparable to private memory?
>
> Likely. But your experiment measures above something different than what
> guest_memfd vs. anon does: guest_memfd doesn't update page tables, so I
> would assume guest_memfd will be faster than MAP_POPULATE.
>
> How do you end up allocating memory for guest_memfd? Using simple
> fallocate()?
Heh, now I spot that your comment was as reply to a series.
If your ioctl is supposed to to more than "allocating memory" like
MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
Because for allocating memory, we would want to use fallocate() instead.
I assume you want to "allocate+copy"?
I'll note that, as we're moving into the direction of moving
guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
ioctls, and think about something generic.
Any clue how your new ioctl will interact with the WIP to have shared
memory as part of guest_memfd? For example, could it be reasonable to
"populate" the shared memory first (via VMA) and then convert that
"allocated+filled" memory to private?
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 15:13 ` David Hildenbrand
@ 2024-11-20 15:58 ` Nikita Kalyazin
2024-11-20 16:20 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 15:58 UTC (permalink / raw)
To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20/11/2024 15:13, David Hildenbrand wrote:
> Hi!
Hi! :)
>> Results:
>> - MAP_PRIVATE: 968 ms
>> - MAP_SHARED: 1646 ms
>
> At least here it is expected to some degree: as soon as the page cache
> is involved map/unmap gets slower, because we are effectively
> maintaining two datastructures (page tables + page cache) instead of
> only a single one (page cache)
>
> Can you make sure that THP/large folios don't interfere in your
> experiments (e.g., madvise(MADV_NOHUGEPAGE))?
I was using transparent_hugepage=never command line argument in my testing.
$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]
Is that sufficient to exclude the THP/large folio factor?
>> While this logic is intuitive, its performance effect is more
>> significant that I would expect.
>
> Yes. How much of the performance difference would remain if you hack out
> the atomic op just to play with it? I suspect there will still be some
> difference.
I have tried that, but could not see any noticeable difference in the
overall results.
It looks like a big portion of the bottleneck has moved from
shmem_get_folio_gfp/folio_mark_uptodate to
finish_fault/__pte_offset_map_lock somehow. I have no good explanation
for why:
Orig:
- 69.62% do_fault
+ 44.61% __do_fault
+ 20.26% filemap_map_pages
+ 3.48% finish_fault
Hacked:
- 67.39% do_fault
+ 32.45% __do_fault
+ 21.87% filemap_map_pages
+ 11.97% finish_fault
Orig:
- 3.48% finish_fault
- 1.28% set_pte_range
0.96% folio_add_file_rmap_ptes
- 0.91% __pte_offset_map_lock
0.54% _raw_spin_lock
Hacked:
- 11.97% finish_fault
- 8.59% __pte_offset_map_lock
- 6.27% _raw_spin_lock
preempt_count_add
1.00% __pte_offset_map
- 1.28% set_pte_range
- folio_add_file_rmap_ptes
__mod_node_page_state
> Note that we might improve allocation times with guest_memfd when
> allocating larger folios.
I suppose it may not always be an option depending on requirements to
consistency of the allocation latency. Eg if a large folio isn't
available at the time, the performance would degrade to the base case
(please correct me if I'm missing something).
> Heh, now I spot that your comment was as reply to a series.
Yeah, sorry if it wasn't obvious.
> If your ioctl is supposed to to more than "allocating memory" like
> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
> Because for allocating memory, we would want to use fallocate() instead.
> I assume you want to "allocate+copy"?
Yes, the ultimate use case is "allocate+copy".
> I'll note that, as we're moving into the direction of moving
> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
> ioctls, and think about something generic.
Good point, thanks. Are we at the stage where some concrete API has
been proposed yet? I might have missed that.
> Any clue how your new ioctl will interact with the WIP to have shared
> memory as part of guest_memfd? For example, could it be reasonable to
> "populate" the shared memory first (via VMA) and then convert that
> "allocated+filled" memory to private?
No, I can't immediately see why it shouldn't work. My main concern
would probably still be about the latency of the population stage as I
can't see why it would improve compared to what we have now, because my
feeling is this is linked with the sharedness property of guest_memfd.
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 15:58 ` Nikita Kalyazin
@ 2024-11-20 16:20 ` David Hildenbrand
2024-11-20 16:44 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:20 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20.11.24 16:58, Nikita Kalyazin wrote:
>
>
> On 20/11/2024 15:13, David Hildenbrand wrote:
> > Hi!
>
> Hi! :)
>
> >> Results:
> >> - MAP_PRIVATE: 968 ms
> >> - MAP_SHARED: 1646 ms
> >
> > At least here it is expected to some degree: as soon as the page cache
> > is involved map/unmap gets slower, because we are effectively
> > maintaining two datastructures (page tables + page cache) instead of
> > only a single one (page cache)
> >
> > Can you make sure that THP/large folios don't interfere in your
> > experiments (e.g., madvise(MADV_NOHUGEPAGE))?
>
> I was using transparent_hugepage=never command line argument in my testing.
>
> $ cat /sys/kernel/mm/transparent_hugepage/enabled
> always madvise [never]
>
> Is that sufficient to exclude the THP/large folio factor?
Yes!
>
> >> While this logic is intuitive, its performance effect is more
> >> significant that I would expect.
> >
> > Yes. How much of the performance difference would remain if you hack out
> > the atomic op just to play with it? I suspect there will still be some
> > difference.
>
> I have tried that, but could not see any noticeable difference in the
> overall results.
>
> It looks like a big portion of the bottleneck has moved from
> shmem_get_folio_gfp/folio_mark_uptodate to
> finish_fault/__pte_offset_map_lock somehow. I have no good explanation
> for why:
That's what I assumed. The profiling results can be rather fuzzy and
misleading with micro-benchmarks. :(
>
> Orig:
> - 69.62% do_fault
> + 44.61% __do_fault
> + 20.26% filemap_map_pages
> + 3.48% finish_fault
> Hacked:
> - 67.39% do_fault
> + 32.45% __do_fault
> + 21.87% filemap_map_pages
> + 11.97% finish_fault
>
> Orig:
> - 3.48% finish_fault
> - 1.28% set_pte_range
> 0.96% folio_add_file_rmap_ptes
> - 0.91% __pte_offset_map_lock
> 0.54% _raw_spin_lock
> Hacked:
> - 11.97% finish_fault
> - 8.59% __pte_offset_map_lock
> - 6.27% _raw_spin_lock
> preempt_count_add
> 1.00% __pte_offset_map
> - 1.28% set_pte_range
> - folio_add_file_rmap_ptes
> __mod_node_page_state
>
> > Note that we might improve allocation times with guest_memfd when
> > allocating larger folios.
>
> I suppose it may not always be an option depending on requirements to
> consistency of the allocation latency. Eg if a large folio isn't
> available at the time, the performance would degrade to the base case
> (please correct me if I'm missing something).
Yes, there are cons to that.
>
>> Heh, now I spot that your comment was as reply to a series.
>
> Yeah, sorry if it wasn't obvious.
>
>> If your ioctl is supposed to to more than "allocating memory" like
>> MAP_POPULATE/MADV_POPULATE+* ... then POPULATE is a suboptimal choice.
>> Because for allocating memory, we would want to use fallocate() instead.
>> I assume you want to "allocate+copy"?
>
> Yes, the ultimate use case is "allocate+copy".
>
>> I'll note that, as we're moving into the direction of moving
>> guest_memfd.c into mm/guestmem.c, we'll likely want to avoid "KVM_*"
>> ioctls, and think about something generic.
>
> Good point, thanks. Are we at the stage where some concrete API has
> been proposed yet? I might have missed that.
People are working on it, and we're figuring out some remaining details
(e.g., page_type to intercept folio_put() ). I assume we'll see a new
RFC soonish (famous last words), but it's not been proposed yet.
>
>> Any clue how your new ioctl will interact with the WIP to have shared
>> memory as part of guest_memfd? For example, could it be reasonable to
>> "populate" the shared memory first (via VMA) and then convert that
>> "allocated+filled" memory to private?
>
> No, I can't immediately see why it shouldn't work. My main concern
> would probably still be about the latency of the population stage as I
> can't see why it would improve compared to what we have now, because my
> feeling is this is linked with the sharedness property of guest_memfd.
If the problem is the "pagecache" overhead, then yes, it will be a
harder nut to crack. But maybe there are some low-hanging fruits to
optimize? Finding the main cause for the added overhead would be
interesting.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 16:20 ` David Hildenbrand
@ 2024-11-20 16:44 ` David Hildenbrand
2024-11-20 17:21 ` Nikita Kalyazin
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 16:44 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
>> No, I can't immediately see why it shouldn't work. My main concern
>> would probably still be about the latency of the population stage as I
>> can't see why it would improve compared to what we have now, because my
> > feeling is this is linked with the sharedness property of guest_memfd.
>
> If the problem is the "pagecache" overhead, then yes, it will be a
> harder nut to crack. But maybe there are some low-hanging fruits to
> optimize? Finding the main cause for the added overhead would be
> interesting.
Can you compare uffdio_copy() when using anonymous memory vs. shmem?
That's likely the best we could currently achieve with guest_memfd.
There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure
if that is of any help; it SEGFAULTS for me right now with a (likely)
division by 0.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 16:44 ` David Hildenbrand
@ 2024-11-20 17:21 ` Nikita Kalyazin
2024-11-20 18:29 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-20 17:21 UTC (permalink / raw)
To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20/11/2024 16:44, David Hildenbrand wrote:
>> If the problem is the "pagecache" overhead, then yes, it will be a
>> harder nut to crack. But maybe there are some low-hanging fruits to
>> optimize? Finding the main cause for the added overhead would be
>> interesting.
Agreed, knowing the exact root cause would be really nice.
> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
> That's likely the best we could currently achieve with guest_memfd.
Yeah, I was doing that too. It was about ~28% slower in my setup, while
with guest_memfd it was ~34% slower. The variance of the data was quite
high so the difference may well be just noise. In other words, I'd be
much happier if we could bring guest_memfd (or even shmem) performance
closer to the anon/private than if we just equalised guest_memfd with
shmem (which are probably already pretty close).
> There is the tools/testing/selftests/mm/uffd-stress benchmark, not sure
> if that is of any help; it SEGFAULTS for me right now with a (likely)
> division by 0.
Thanks for the pointer, will take a look!
> Cheers,
>
> David / dhildenb
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 17:21 ` Nikita Kalyazin
@ 2024-11-20 18:29 ` David Hildenbrand
2024-11-21 16:46 ` Nikita Kalyazin
0 siblings, 1 reply; 11+ messages in thread
From: David Hildenbrand @ 2024-11-20 18:29 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20.11.24 18:21, Nikita Kalyazin wrote:
>
>
> On 20/11/2024 16:44, David Hildenbrand wrote:
>>> If the problem is the "pagecache" overhead, then yes, it will be a
>>> harder nut to crack. But maybe there are some low-hanging fruits to
>>> optimize? Finding the main cause for the added overhead would be
>>> interesting.
>
> Agreed, knowing the exact root cause would be really nice.
>
>> Can you compare uffdio_copy() when using anonymous memory vs. shmem?
>> That's likely the best we could currently achieve with guest_memfd.
>
> Yeah, I was doing that too. It was about ~28% slower in my setup, while
> with guest_memfd it was ~34% slower.
I looked into uffdio_copy() for shmem and we still walk+modify page
tables. In theory, we could try hacking that out: for filling the
pagecache we would only need the vma properties, not the page table
properties; that would then really resemble "only modify the pagecache".
That would likely resemble what we would expect with guest_memfd: work
only on the pagecache and not the page tables. So it's rather surprising
that guest_memfd is slower than that, as it currently doesn't mess with
user page tables at all.
The variance of the data was quite
> high so the difference may well be just noise. In other words, I'd be
> much happier if we could bring guest_memfd (or even shmem) performance
> closer to the anon/private than if we just equalised guest_memfd with
> shmem (which are probably already pretty close).
Makes sense. Best we can do is:
anon: work only on page tables
shmem/guest_memfd: work only on pageacache
So at least "only one treelike structure to update".
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-20 18:29 ` David Hildenbrand
@ 2024-11-21 16:46 ` Nikita Kalyazin
2024-11-26 16:04 ` Nikita Kalyazin
0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-21 16:46 UTC (permalink / raw)
To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 20/11/2024 18:29, David Hildenbrand wrote:
> Any clue how your new ioctl will interact with the WIP to have shared
> memory as part of guest_memfd? For example, could it be reasonable to
> "populate" the shared memory first (via VMA) and then convert that
> "allocated+filled" memory to private?
Patrick and I synced internally on this. What may actually work for
guest_memfd population is the following.
Non-CoCo use case:
- fallocate syscall to fill the page cache, no page content
initialisation (like it is now)
- pwrite syscall to initialise the content + mark up-to-date (mark
prepared), no specific preparation logic is required
The pwrite will have "once" semantics until a subsequent
fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the
page is already prepared and return EIO/ENOSPC or something.
SEV-SNP use case (no changes):
- fallocate as above
- KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare
We don't think fallocate/pwrite have dependencies on current->mm
assumptions that Paolo mentioned in [1], so they should be safe to be
called on guest_memfd from a non-VMM process.
[1]:
https://lore.kernel.org/kvm/20241024095429.54052-1-kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c
> Makes sense. Best we can do is:
>
> anon: work only on page tables
> shmem/guest_memfd: work only on pageacache
>
> So at least "only one treelike structure to update".
This seems to hold with the above reasoning.
> --
> Cheers,
>
> David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-21 16:46 ` Nikita Kalyazin
@ 2024-11-26 16:04 ` Nikita Kalyazin
2024-11-28 12:11 ` David Hildenbrand
0 siblings, 1 reply; 11+ messages in thread
From: Nikita Kalyazin @ 2024-11-26 16:04 UTC (permalink / raw)
To: David Hildenbrand, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 21/11/2024 16:46, Nikita Kalyazin wrote:
>
>
> On 20/11/2024 18:29, David Hildenbrand wrote:
> > Any clue how your new ioctl will interact with the WIP to have shared
> > memory as part of guest_memfd? For example, could it be reasonable to
> > "populate" the shared memory first (via VMA) and then convert that
> > "allocated+filled" memory to private?
>
> Patrick and I synced internally on this. What may actually work for
> guest_memfd population is the following.
>
> Non-CoCo use case:
> - fallocate syscall to fill the page cache, no page content
> initialisation (like it is now)
> - pwrite syscall to initialise the content + mark up-to-date (mark
> prepared), no specific preparation logic is required
>
> The pwrite will have "once" semantics until a subsequent
> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the
> page is already prepared and return EIO/ENOSPC or something.
I prototyped that to see if it was possible (and it was). Actually the
write syscall can also do the allocation part, so no prior fallocate
would be required. The only thing is there is a cap on how much IO can
be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a
significant problem. Does it sound like an acceptable solution?
[1]: https://elixir.bootlin.com/linux/v6.12.1/source/fs/read_write.c#L507
>
> SEV-SNP use case (no changes):
> - fallocate as above
> - KVM_SEV_SNP_LAUNCH_UPDATE to initialise/prepare
>
> We don't think fallocate/pwrite have dependencies on current->mm
> assumptions that Paolo mentioned in [1], so they should be safe to be
> called on guest_memfd from a non-VMM process.
>
> [1]: https://lore.kernel.org/kvm/20241024095429.54052-1-
> kalyazin@amazon.com/T/#m57498f8e2fde577ad1da948ec74dd2225cd2056c
>
> > Makes sense. Best we can do is:
> >
> > anon: work only on page tables
> > shmem/guest_memfd: work only on pageacache
> >
> > So at least "only one treelike structure to update".
>
> This seems to hold with the above reasoning.
>
> > --
>> Cheers,
>>
>> David / dhildenb
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd
2024-11-26 16:04 ` Nikita Kalyazin
@ 2024-11-28 12:11 ` David Hildenbrand
0 siblings, 0 replies; 11+ messages in thread
From: David Hildenbrand @ 2024-11-28 12:11 UTC (permalink / raw)
To: kalyazin, pbonzini, corbet, kvm, linux-doc, linux-kernel
Cc: jthoughton, brijesh.singh, michael.roth, graf, jgowans, roypat,
derekmn, nsaenz, xmarcalx, Sean Christopherson, linux-mm
On 26.11.24 17:04, Nikita Kalyazin wrote:
>
>
> On 21/11/2024 16:46, Nikita Kalyazin wrote:
>>
>>
>> On 20/11/2024 18:29, David Hildenbrand wrote:
>> > Any clue how your new ioctl will interact with the WIP to have shared
>> > memory as part of guest_memfd? For example, could it be reasonable to
>> > "populate" the shared memory first (via VMA) and then convert that
>> > "allocated+filled" memory to private?
>>
>> Patrick and I synced internally on this. What may actually work for
>> guest_memfd population is the following.
>>
>> Non-CoCo use case:
>> - fallocate syscall to fill the page cache, no page content
>> initialisation (like it is now)
>> - pwrite syscall to initialise the content + mark up-to-date (mark
>> prepared), no specific preparation logic is required
>>
>> The pwrite will have "once" semantics until a subsequent
>> fallocate(FALLOC_FL_PUNCH_HOLE), ie the next pwrite call will "see" the
>> page is already prepared and return EIO/ENOSPC or something.
>
> I prototyped that to see if it was possible (and it was). Actually the
> write syscall can also do the allocation part, so no prior fallocate
> would be required.
Right
> The only thing is there is a cap on how much IO can
> be done in a single call (MAX_RW_COUNT) [1], but it doesn't look like a
> significant problem. Does it sound like an acceptable solution?
Does sound quite clean to me. Of course, one thing to figure out is how
to enable this only for that special type of VM type, but that should be
possible to be resolved.
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2024-11-28 12:11 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
[not found] <20241024095429.54052-1-kalyazin@amazon.com>
2024-11-20 12:09 ` [RFC PATCH 0/4] KVM: ioctl for populating guest_memfd Nikita Kalyazin
2024-11-20 13:46 ` David Hildenbrand
2024-11-20 15:13 ` David Hildenbrand
2024-11-20 15:58 ` Nikita Kalyazin
2024-11-20 16:20 ` David Hildenbrand
2024-11-20 16:44 ` David Hildenbrand
2024-11-20 17:21 ` Nikita Kalyazin
2024-11-20 18:29 ` David Hildenbrand
2024-11-21 16:46 ` Nikita Kalyazin
2024-11-26 16:04 ` Nikita Kalyazin
2024-11-28 12:11 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox