From: Stefan Roesch <shr@devkernel.io>
To: David Hildenbrand <david@redhat.com>
Cc: kernel-team@fb.com, linux-mm@kvack.org, riel@surriel.com,
mhocko@suse.com, linux-kselftest@vger.kernel.org,
linux-doc@vger.kernel.org, akpm@linux-foundation.org,
hannes@cmpxchg.org, Bagas Sanjaya <bagasdotme@gmail.com>,
Janosch Frank <frankja@linux.ibm.com>,
Christian Borntraeger <borntraeger@de.ibm.com>
Subject: Re: [PATCH v4 1/3] mm: add new api to enable ksm per process
Date: Tue, 04 Apr 2023 09:32:31 -0700 [thread overview]
Message-ID: <qvqw4jpviov1.fsf@dev0134.prn3.facebook.com> (raw)
In-Reply-To: <e888871b-9f48-c01d-ce7f-f32ec3d79ef8@redhat.com>
David Hildenbrand <david@redhat.com> writes:
> On 03.04.23 12:37, David Hildenbrand wrote:
>> On 10.03.23 19:28, Stefan Roesch wrote:
>>> Patch series "mm: process/cgroup ksm support", v3.
>>>
>>> So far KSM can only be enabled by calling madvise for memory regions. To
>>> be able to use KSM for more workloads, KSM needs to have the ability to be
>>> enabled / disabled at the process / cgroup level.
>>>
>>> Use case 1:
>>>
>>> The madvise call is not available in the programming language. An
>>> example for this are programs with forked workloads using a garbage
>>> collected language without pointers. In such a language madvise cannot
>>> be made available.
>>>
>>> In addition the addresses of objects get moved around as they are
>>> garbage collected. KSM sharing needs to be enabled "from the outside"
>>> for these type of workloads.
>> I guess the interpreter could enable it (like a memory allocator could
>> enable it for the whole heap). But I get that it's much easier to enable
>> this per-process, and eventually only when a lot of the same processes
>> are running in that particular environment.
>>
>>>
>>> Use case 2:
>>>
>>> The same interpreter can also be used for workloads where KSM brings
>>> no benefit or even has overhead. We'd like to be able to enable KSM on
>>> a workload by workload basis.
>> Agreed. A per-process control is also helpful to identidy workloads
>> where KSM might be beneficial (and to which degree).
>>
>>>
>>> Use case 3:
>>>
>>> With the madvise call sharing opportunities are only enabled for the
>>> current process: it is a workload-local decision. A considerable number
>>> of sharing opportuniites may exist across multiple workloads or jobs.
>>> Only a higler level entity like a job scheduler or container can know
>>> for certain if its running one or more instances of a job. That job
>>> scheduler however doesn't have the necessary internal worklaod knowledge
>>> to make targeted madvise calls.
>>>
>>> Security concerns:
>>>
>>> In previous discussions security concerns have been brought up. The
>>> problem is that an individual workload does not have the knowledge about
>>> what else is running on a machine. Therefore it has to be very
>>> conservative in what memory areas can be shared or not. However, if the
>>> system is dedicated to running multiple jobs within the same security
>>> domain, its the job scheduler that has the knowledge that sharing can be
>>> safely enabled and is even desirable.
>>>
>>> Performance:
>>>
>>> Experiments with using UKSM have shown a capacity increase of around
>>> 20%.
>>>
>> As raised, it would be great to include more details about the workload
>> where this particulalry helps (e.g., a lot of Django processes operating
>> in the same domain).
>>
>>>
>>> 1. New options for prctl system command
>>>
>>> This patch series adds two new options to the prctl system call.
>>> The first one allows to enable KSM at the process level and the second
>>> one to query the setting.
>>>
>>> The setting will be inherited by child processes.
>>>
>>> With the above setting, KSM can be enabled for the seed process of a
>>> cgroup and all processes in the cgroup will inherit the setting.
>>>
>>> 2. Changes to KSM processing
>>>
>>> When KSM is enabled at the process level, the KSM code will iterate
>>> over all the VMA's and enable KSM for the eligible VMA's.
>>>
>>> When forking a process that has KSM enabled, the setting will be
>>> inherited by the new child process.
>>>
>>> In addition when KSM is disabled for a process, KSM will be disabled
>>> for the VMA's where KSM has been enabled.
>> Do we want to make MADV_MERGEABLE/MADV_UNMERGEABLE fail while the new
>> prctl is enabled for a process?
>>
>>>
>>> 3. Add general_profit metric
>>>
>>> The general_profit metric of KSM is specified in the documentation,
>>> but not calculated. This adds the general profit metric to
>>> /sys/kernel/debug/mm/ksm.
>>>
>>> 4. Add more metrics to ksm_stat
>>>
>>> This adds the process profit and ksm type metric to
>>> /proc/<pid>/ksm_stat.
>>>
>>> 5. Add more tests to ksm_tests
>>>
>>> This adds an option to specify the merge type to the ksm_tests.
>>> This allows to test madvise and prctl KSM. It also adds a new option
>>> to query if prctl KSM has been enabled. It adds a fork test to verify
>>> that the KSM process setting is inherited by client processes.
>>>
>>> An update to the prctl(2) manpage has been proposed at [1].
>>>
>>> This patch (of 3):
>>>
>>> This adds a new prctl to API to enable and disable KSM on a per process
>>> basis instead of only at the VMA basis (with madvise).
>>>
>>> 1) Introduce new MMF_VM_MERGE_ANY flag
>>>
>>> This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
>>> is set, kernel samepage merging (ksm) gets enabled for all vma's of a
>>> process.
>>>
>>> 2) add flag to __ksm_enter
>>>
>>> This change adds the flag parameter to __ksm_enter. This allows to
>>> distinguish if ksm was called by prctl or madvise.
>>>
>>> 3) add flag to __ksm_exit call
>>>
>>> This adds the flag parameter to the __ksm_exit() call. This allows
>>> to distinguish if this call is for an prctl or madvise invocation.
>>>
>>> 4) invoke madvise for all vmas in scan_get_next_rmap_item
>>>
>>> If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
>>> over all the vmas and enable ksm if possible. For the vmas that can be
>>> ksm enabled this is only done once.
>>>
>>> 5) support disabling of ksm for a process
>>>
>>> This adds the ability to disable ksm for a process if ksm has been
>>> enabled for the process.
>>>
>>> 6) add new prctl option to get and set ksm for a process
>>>
>>> This adds two new options to the prctl system call
>>> - enable ksm for all vmas of a process (if the vmas support it).
>>> - query if ksm has been enabled for a process.
>> Did you consider, instead of handling MMF_VM_MERGE_ANY in a special way,
>> to instead make it reuse the existing MMF_VM_MERGEABLE/VM_MERGEABLE
>> infrastructure. Especially:
>> 1) During prctl(MMF_VM_MERGE_ANY), set VM_MERGABLE on all applicable
>> compatible. Further, set MMF_VM_MERGEABLE and enter KSM if not
>> already set.
>> 2) When creating a new, compatible VMA and MMF_VM_MERGE_ANY is set, set
>> VM_MERGABLE?
>> The you can avoid all runtime checks for compatible VMAs and only look
>> at the VM_MERGEABLE flag. In fact, the VM_MERGEABLE will be completely
>> expressive then for all VMAs. You don't need vma_ksm_mergeable() then.
>> Another thing to consider is interaction with arch/s390/mm/gmap.c:
>> s390x/kvm does not support KSM and it has to disable it for all VMAs. We
>> have to find a way to fence the prctl (for example, fail setting the
>> prctl after gmap_mark_unmergeable() ran, and make
>> gmap_mark_unmergeable() fail if the prctl ran -- or handle it gracefully
>> in some other way).
gmap_mark_unmergeable() seems to have a problem today. We can execute
gmap_mark_unmergeable() and mark the vma's as unmergeable, but shortly
after that the process can run madvise on it again and make it
mergeable. Am I mssing something here?
Once prctl is run, we can check for the MMF_VM_MERGE_ANY flag in
gmap_mark_unmergeable(). In case it is set, we can return an error. The
error code path looks like it can handle that case.
For the opposite case: gmap_mark_unmergeable() has already been run, we
would need some kind of flag or other means to be able to detect it.
Any recommendations?
>
> Staring at that code, I wonder if the "mm->def_flags &= ~VM_MERGEABLE" is doing
> what it's supposed to do. I don't think this effectively prevents right now
> madvise() from getting re-enabled on that VMA.
>
> @Christian, Janosch, am I missing something?
next prev parent reply other threads:[~2023-04-04 16:39 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-10 18:28 [PATCH v4 0/3] mm: process/cgroup ksm support Stefan Roesch
2023-03-10 18:28 ` [PATCH v4 1/3] mm: add new api to enable ksm per process Stefan Roesch
2023-03-13 16:26 ` Johannes Weiner
2023-04-03 10:37 ` David Hildenbrand
2023-04-03 11:03 ` David Hildenbrand
2023-04-04 16:32 ` Stefan Roesch [this message]
2023-04-04 16:43 ` Stefan Roesch
2023-04-05 6:51 ` Christian Borntraeger
2023-04-05 16:04 ` David Hildenbrand
2023-04-03 15:50 ` Stefan Roesch
2023-04-03 17:02 ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 2/3] mm: add new KSM process and sysfs knobs Stefan Roesch
2023-04-05 17:04 ` David Hildenbrand
2023-04-05 21:20 ` Stefan Roesch
2023-04-06 13:23 ` David Hildenbrand
2023-04-06 14:16 ` Johannes Weiner
2023-04-06 14:32 ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 3/3] selftests/mm: add new selftests for KSM Stefan Roesch
2023-03-15 20:03 ` [PATCH v4 0/3] mm: process/cgroup ksm support David Hildenbrand
2023-03-15 20:23 ` Mike Kravetz
2023-03-15 21:05 ` Johannes Weiner
2023-03-15 21:19 ` Johannes Weiner
2023-03-15 21:45 ` David Hildenbrand
2023-03-15 21:47 ` David Hildenbrand
2023-03-30 16:19 ` Stefan Roesch
2023-03-28 23:09 ` Andrew Morton
2023-03-30 4:55 ` David Hildenbrand
2023-03-30 14:26 ` Johannes Weiner
2023-03-30 14:40 ` David Hildenbrand
2023-03-30 16:41 ` Stefan Roesch
2023-04-03 9:48 ` David Hildenbrand
2023-04-03 16:34 ` Stefan Roesch
2023-04-03 17:04 ` David Hildenbrand
2023-04-06 16:59 ` Stefan Roesch
2023-04-06 17:10 ` David Hildenbrand
2023-03-30 20:18 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=qvqw4jpviov1.fsf@dev0134.prn3.facebook.com \
--to=shr@devkernel.io \
--cc=akpm@linux-foundation.org \
--cc=bagasdotme@gmail.com \
--cc=borntraeger@de.ibm.com \
--cc=david@redhat.com \
--cc=frankja@linux.ibm.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=riel@surriel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox