Re: [PATCH v4 1/3] mm: add new api to enable ksm per process

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Christian Borntraeger <borntraeger@de.ibm.com>,
	Stefan Roesch <shr@devkernel.io>,
	kernel-team@fb.com
Cc: linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com,
	linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
	akpm@linux-foundation.org, hannes@cmpxchg.org,
	Bagas Sanjaya <bagasdotme@gmail.com>,
	Janosch Frank <frankja@linux.ibm.com>
Subject: Re: [PATCH v4 1/3] mm: add new api to enable ksm per process
Date: Wed, 5 Apr 2023 18:04:29 +0200	[thread overview]
Message-ID: <d3af0fb8-64c4-fc68-0e8c-9bdab8706e02@redhat.com> (raw)
In-Reply-To: <2229abe0-b304-6ae3-5bda-d71387c645ca@de.ibm.com>

On 05.04.23 08:51, Christian Borntraeger wrote:
> Am 03.04.23 um 13:03 schrieb David Hildenbrand:
>> On 03.04.23 12:37, David Hildenbrand wrote:
>>> On 10.03.23 19:28, Stefan Roesch wrote:
>>>> Patch series "mm: process/cgroup ksm support", v3.
>>>>
>>>> So far KSM can only be enabled by calling madvise for memory regions.  To
>>>> be able to use KSM for more workloads, KSM needs to have the ability to be
>>>> enabled / disabled at the process / cgroup level.
>>>>
>>>> Use case 1:
>>>>
>>>>      The madvise call is not available in the programming language.  An
>>>>      example for this are programs with forked workloads using a garbage
>>>>      collected language without pointers.  In such a language madvise cannot
>>>>      be made available.
>>>>
>>>>      In addition the addresses of objects get moved around as they are
>>>>      garbage collected.  KSM sharing needs to be enabled "from the outside"
>>>>      for these type of workloads.
>>>
>>> I guess the interpreter could enable it (like a memory allocator could
>>> enable it for the whole heap). But I get that it's much easier to enable
>>> this per-process, and eventually only when a lot of the same processes
>>> are running in that particular environment.
>>>
>>>>
>>>> Use case 2:
>>>>
>>>>      The same interpreter can also be used for workloads where KSM brings
>>>>      no benefit or even has overhead.  We'd like to be able to enable KSM on
>>>>      a workload by workload basis.
>>>
>>> Agreed. A per-process control is also helpful to identidy workloads
>>> where KSM might be beneficial (and to which degree).
>>>
>>>>
>>>> Use case 3:
>>>>
>>>>      With the madvise call sharing opportunities are only enabled for the
>>>>      current process: it is a workload-local decision.  A considerable number
>>>>      of sharing opportuniites may exist across multiple workloads or jobs.
>>>>      Only a higler level entity like a job scheduler or container can know
>>>>      for certain if its running one or more instances of a job.  That job
>>>>      scheduler however doesn't have the necessary internal worklaod knowledge
>>>>      to make targeted madvise calls.
>>>>
>>>> Security concerns:
>>>>
>>>>      In previous discussions security concerns have been brought up.  The
>>>>      problem is that an individual workload does not have the knowledge about
>>>>      what else is running on a machine.  Therefore it has to be very
>>>>      conservative in what memory areas can be shared or not.  However, if the
>>>>      system is dedicated to running multiple jobs within the same security
>>>>      domain, its the job scheduler that has the knowledge that sharing can be
>>>>      safely enabled and is even desirable.
>>>>
>>>> Performance:
>>>>
>>>>      Experiments with using UKSM have shown a capacity increase of around
>>>>      20%.
>>>>
>>>
>>> As raised, it would be great to include more details about the workload
>>> where this particulalry helps (e.g., a lot of Django processes operating
>>> in the same domain).
>>>
>>>>
>>>> 1. New options for prctl system command
>>>>
>>>>       This patch series adds two new options to the prctl system call.
>>>>       The first one allows to enable KSM at the process level and the second
>>>>       one to query the setting.
>>>>
>>>>       The setting will be inherited by child processes.
>>>>
>>>>       With the above setting, KSM can be enabled for the seed process of a
>>>>       cgroup and all processes in the cgroup will inherit the setting.
>>>>
>>>> 2. Changes to KSM processing
>>>>
>>>>       When KSM is enabled at the process level, the KSM code will iterate
>>>>       over all the VMA's and enable KSM for the eligible VMA's.
>>>>
>>>>       When forking a process that has KSM enabled, the setting will be
>>>>       inherited by the new child process.
>>>>
>>>>       In addition when KSM is disabled for a process, KSM will be disabled
>>>>       for the VMA's where KSM has been enabled.
>>>
>>> Do we want to make MADV_MERGEABLE/MADV_UNMERGEABLE fail while the new
>>> prctl is enabled for a process?
>>>
>>>>
>>>> 3. Add general_profit metric
>>>>
>>>>       The general_profit metric of KSM is specified in the documentation,
>>>>       but not calculated.  This adds the general profit metric to
>>>>       /sys/kernel/debug/mm/ksm.
>>>>
>>>> 4. Add more metrics to ksm_stat
>>>>
>>>>       This adds the process profit and ksm type metric to
>>>>       /proc/<pid>/ksm_stat.
>>>>
>>>> 5. Add more tests to ksm_tests
>>>>
>>>>       This adds an option to specify the merge type to the ksm_tests.
>>>>       This allows to test madvise and prctl KSM.  It also adds a new option
>>>>       to query if prctl KSM has been enabled.  It adds a fork test to verify
>>>>       that the KSM process setting is inherited by client processes.
>>>>
>>>> An update to the prctl(2) manpage has been proposed at [1].
>>>>
>>>> This patch (of 3):
>>>>
>>>> This adds a new prctl to API to enable and disable KSM on a per process
>>>> basis instead of only at the VMA basis (with madvise).
>>>>
>>>> 1) Introduce new MMF_VM_MERGE_ANY flag
>>>>
>>>>       This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
>>>>       is set, kernel samepage merging (ksm) gets enabled for all vma's of a
>>>>       process.
>>>>
>>>> 2) add flag to __ksm_enter
>>>>
>>>>       This change adds the flag parameter to __ksm_enter.  This allows to
>>>>       distinguish if ksm was called by prctl or madvise.
>>>>
>>>> 3) add flag to __ksm_exit call
>>>>
>>>>       This adds the flag parameter to the __ksm_exit() call.  This allows
>>>>       to distinguish if this call is for an prctl or madvise invocation.
>>>>
>>>> 4) invoke madvise for all vmas in scan_get_next_rmap_item
>>>>
>>>>       If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
>>>>       over all the vmas and enable ksm if possible.  For the vmas that can be
>>>>       ksm enabled this is only done once.
>>>>
>>>> 5) support disabling of ksm for a process
>>>>
>>>>       This adds the ability to disable ksm for a process if ksm has been
>>>>       enabled for the process.
>>>>
>>>> 6) add new prctl option to get and set ksm for a process
>>>>
>>>>       This adds two new options to the prctl system call
>>>>       - enable ksm for all vmas of a process (if the vmas support it).
>>>>       - query if ksm has been enabled for a process.
>>>
>>>
>>> Did you consider, instead of handling MMF_VM_MERGE_ANY in a special way,
>>> to instead make it reuse the existing MMF_VM_MERGEABLE/VM_MERGEABLE
>>> infrastructure. Especially:
>>>
>>> 1) During prctl(MMF_VM_MERGE_ANY), set VM_MERGABLE on all applicable
>>>       compatible. Further, set MMF_VM_MERGEABLE and enter KSM if not
>>>       already set.
>>>
>>> 2) When creating a new, compatible VMA and MMF_VM_MERGE_ANY is set, set
>>>       VM_MERGABLE?
>>>
>>> The you can avoid all runtime checks for compatible VMAs and only look
>>> at the VM_MERGEABLE flag. In fact, the VM_MERGEABLE will be completely
>>> expressive then for all VMAs. You don't need vma_ksm_mergeable() then.
>>>
>>>
>>> Another thing to consider is interaction with arch/s390/mm/gmap.c:
>>> s390x/kvm does not support KSM and it has to disable it for all VMAs. We
> 
> Normally we do support KSM on s390. This is a special case for guests using
> storage keys. Those are attributes of the physical page and might differ even
> if the content of the page is the same.
> New Linux no longer uses it (unless a debug option is set during build) so we
> enable the guest storage keys lazy and break KSM pages in that process.
> Ideally we would continue this semantic (e.g. even after a prctl, if the
> guest enable storage keys, disable ksm for this VM).

IIRC, KSM also gets disabled when switching to protected VMs. I recall 
that we really wanted to stop KSM scanning pages that are possibly 
protected. (don't remember if one could harm the system enabling it 
before/after the switch)

> 
>>> have to find a way to fence the prctl (for example, fail setting the
>>> prctl after gmap_mark_unmergeable() ran, and make
>>> gmap_mark_unmergeable() fail if the prctl ran -- or handle it gracefully
>>> in some other way).
>>
>>
>> Staring at that code, I wonder if the "mm->def_flags &= ~VM_MERGEABLE" is doing what it's supposed to do. I don't think this effectively prevents right now madvise() from getting re-enabled on that VMA.
>>
>> @Christian, Janosch, am I missing something?
> 
> Yes, if QEMU would do an madvise later on instead of just the start if would
> result in guest storage keys to be messed up on KSM merges. One could argue
> that this is a bug in the hypervisor then (QEMU) but yes, we should try
> to make this more reliable in the kernel.

It looks like the "mm->def_flags &= ~VM_MERGEABLE" wanted to achieve 
that, but failed. At least it looks like completely unnecessary code if 
I am not wrong.

Maybe inspired by similar code in thp_split_mm(), that enforces 
VM_NOHUGEPAGE.

-- 
Thanks,

David / dhildenb

next prev parent reply	other threads:[~2023-04-05 16:04 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-10 18:28 [PATCH v4 0/3] mm: process/cgroup ksm support Stefan Roesch
2023-03-10 18:28 ` [PATCH v4 1/3] mm: add new api to enable ksm per process Stefan Roesch
2023-03-13 16:26   ` Johannes Weiner
2023-04-03 10:37   ` David Hildenbrand
2023-04-03 11:03     ` David Hildenbrand
2023-04-04 16:32       ` Stefan Roesch
2023-04-04 16:43       ` Stefan Roesch
2023-04-05  6:51       ` Christian Borntraeger
2023-04-05 16:04         ` David Hildenbrand [this message]
2023-04-03 15:50     ` Stefan Roesch
2023-04-03 17:02       ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 2/3] mm: add new KSM process and sysfs knobs Stefan Roesch
2023-04-05 17:04   ` David Hildenbrand
2023-04-05 21:20     ` Stefan Roesch
2023-04-06 13:23       ` David Hildenbrand
2023-04-06 14:16         ` Johannes Weiner
2023-04-06 14:32           ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 3/3] selftests/mm: add new selftests for KSM Stefan Roesch
2023-03-15 20:03 ` [PATCH v4 0/3] mm: process/cgroup ksm support David Hildenbrand
2023-03-15 20:23   ` Mike Kravetz
2023-03-15 21:05   ` Johannes Weiner
2023-03-15 21:19     ` Johannes Weiner
2023-03-15 21:45       ` David Hildenbrand
2023-03-15 21:47         ` David Hildenbrand
2023-03-30 16:19         ` Stefan Roesch
2023-03-28 23:09 ` Andrew Morton
2023-03-30  4:55   ` David Hildenbrand
2023-03-30 14:26     ` Johannes Weiner
2023-03-30 14:40       ` David Hildenbrand
2023-03-30 16:41         ` Stefan Roesch
2023-04-03  9:48           ` David Hildenbrand
2023-04-03 16:34             ` Stefan Roesch
2023-04-03 17:04               ` David Hildenbrand
2023-04-06 16:59               ` Stefan Roesch
2023-04-06 17:10                 ` David Hildenbrand
2023-03-30 20:18     ` Andrew Morton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d3af0fb8-64c4-fc68-0e8c-9bdab8706e02@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=bagasdotme@gmail.com \
    --cc=borntraeger@de.ibm.com \
    --cc=frankja@linux.ibm.com \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=riel@surriel.com \
    --cc=shr@devkernel.io \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox