From: David Hildenbrand <david@redhat.com>
To: Stefan Roesch <shr@devkernel.io>, kernel-team@fb.com
Cc: linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com,
linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org,
akpm@linux-foundation.org, hannes@cmpxchg.org,
Bagas Sanjaya <bagasdotme@gmail.com>
Subject: Re: [PATCH v4 1/3] mm: add new api to enable ksm per process
Date: Mon, 3 Apr 2023 12:37:56 +0200 [thread overview]
Message-ID: <7ed4308d-b400-d2bb-b539-3fe418862ab8@redhat.com> (raw)
In-Reply-To: <20230310182851.2579138-2-shr@devkernel.io>
On 10.03.23 19:28, Stefan Roesch wrote:
> Patch series "mm: process/cgroup ksm support", v3.
>
> So far KSM can only be enabled by calling madvise for memory regions. To
> be able to use KSM for more workloads, KSM needs to have the ability to be
> enabled / disabled at the process / cgroup level.
>
> Use case 1:
>
> The madvise call is not available in the programming language. An
> example for this are programs with forked workloads using a garbage
> collected language without pointers. In such a language madvise cannot
> be made available.
>
> In addition the addresses of objects get moved around as they are
> garbage collected. KSM sharing needs to be enabled "from the outside"
> for these type of workloads.
I guess the interpreter could enable it (like a memory allocator could
enable it for the whole heap). But I get that it's much easier to enable
this per-process, and eventually only when a lot of the same processes
are running in that particular environment.
>
> Use case 2:
>
> The same interpreter can also be used for workloads where KSM brings
> no benefit or even has overhead. We'd like to be able to enable KSM on
> a workload by workload basis.
Agreed. A per-process control is also helpful to identidy workloads
where KSM might be beneficial (and to which degree).
>
> Use case 3:
>
> With the madvise call sharing opportunities are only enabled for the
> current process: it is a workload-local decision. A considerable number
> of sharing opportuniites may exist across multiple workloads or jobs.
> Only a higler level entity like a job scheduler or container can know
> for certain if its running one or more instances of a job. That job
> scheduler however doesn't have the necessary internal worklaod knowledge
> to make targeted madvise calls.
>
> Security concerns:
>
> In previous discussions security concerns have been brought up. The
> problem is that an individual workload does not have the knowledge about
> what else is running on a machine. Therefore it has to be very
> conservative in what memory areas can be shared or not. However, if the
> system is dedicated to running multiple jobs within the same security
> domain, its the job scheduler that has the knowledge that sharing can be
> safely enabled and is even desirable.
>
> Performance:
>
> Experiments with using UKSM have shown a capacity increase of around
> 20%.
>
As raised, it would be great to include more details about the workload
where this particulalry helps (e.g., a lot of Django processes operating
in the same domain).
>
> 1. New options for prctl system command
>
> This patch series adds two new options to the prctl system call.
> The first one allows to enable KSM at the process level and the second
> one to query the setting.
>
> The setting will be inherited by child processes.
>
> With the above setting, KSM can be enabled for the seed process of a
> cgroup and all processes in the cgroup will inherit the setting.
>
> 2. Changes to KSM processing
>
> When KSM is enabled at the process level, the KSM code will iterate
> over all the VMA's and enable KSM for the eligible VMA's.
>
> When forking a process that has KSM enabled, the setting will be
> inherited by the new child process.
>
> In addition when KSM is disabled for a process, KSM will be disabled
> for the VMA's where KSM has been enabled.
Do we want to make MADV_MERGEABLE/MADV_UNMERGEABLE fail while the new
prctl is enabled for a process?
>
> 3. Add general_profit metric
>
> The general_profit metric of KSM is specified in the documentation,
> but not calculated. This adds the general profit metric to
> /sys/kernel/debug/mm/ksm.
>
> 4. Add more metrics to ksm_stat
>
> This adds the process profit and ksm type metric to
> /proc/<pid>/ksm_stat.
>
> 5. Add more tests to ksm_tests
>
> This adds an option to specify the merge type to the ksm_tests.
> This allows to test madvise and prctl KSM. It also adds a new option
> to query if prctl KSM has been enabled. It adds a fork test to verify
> that the KSM process setting is inherited by client processes.
>
> An update to the prctl(2) manpage has been proposed at [1].
>
> This patch (of 3):
>
> This adds a new prctl to API to enable and disable KSM on a per process
> basis instead of only at the VMA basis (with madvise).
>
> 1) Introduce new MMF_VM_MERGE_ANY flag
>
> This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag
> is set, kernel samepage merging (ksm) gets enabled for all vma's of a
> process.
>
> 2) add flag to __ksm_enter
>
> This change adds the flag parameter to __ksm_enter. This allows to
> distinguish if ksm was called by prctl or madvise.
>
> 3) add flag to __ksm_exit call
>
> This adds the flag parameter to the __ksm_exit() call. This allows
> to distinguish if this call is for an prctl or madvise invocation.
>
> 4) invoke madvise for all vmas in scan_get_next_rmap_item
>
> If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
> over all the vmas and enable ksm if possible. For the vmas that can be
> ksm enabled this is only done once.
>
> 5) support disabling of ksm for a process
>
> This adds the ability to disable ksm for a process if ksm has been
> enabled for the process.
>
> 6) add new prctl option to get and set ksm for a process
>
> This adds two new options to the prctl system call
> - enable ksm for all vmas of a process (if the vmas support it).
> - query if ksm has been enabled for a process.
Did you consider, instead of handling MMF_VM_MERGE_ANY in a special way,
to instead make it reuse the existing MMF_VM_MERGEABLE/VM_MERGEABLE
infrastructure. Especially:
1) During prctl(MMF_VM_MERGE_ANY), set VM_MERGABLE on all applicable
compatible. Further, set MMF_VM_MERGEABLE and enter KSM if not
already set.
2) When creating a new, compatible VMA and MMF_VM_MERGE_ANY is set, set
VM_MERGABLE?
The you can avoid all runtime checks for compatible VMAs and only look
at the VM_MERGEABLE flag. In fact, the VM_MERGEABLE will be completely
expressive then for all VMAs. You don't need vma_ksm_mergeable() then.
Another thing to consider is interaction with arch/s390/mm/gmap.c:
s390x/kvm does not support KSM and it has to disable it for all VMAs. We
have to find a way to fence the prctl (for example, fail setting the
prctl after gmap_mark_unmergeable() ran, and make
gmap_mark_unmergeable() fail if the prctl ran -- or handle it gracefully
in some other way).
>
> Link: https://lkml.kernel.org/r/20230227220206.436662-1-shr@devkernel.io [1]
> Link: https://lkml.kernel.org/r/20230224044000.3084046-1-shr@devkernel.io
> Link: https://lkml.kernel.org/r/20230224044000.3084046-2-shr@devkernel.io
> Signed-off-by: Stefan Roesch <shr@devkernel.io>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Rik van Riel <riel@surriel.com>
> Cc: Bagas Sanjaya <bagasdotme@gmail.com>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> ---
> include/linux/ksm.h | 14 ++++--
> include/linux/sched/coredump.h | 1 +
> include/uapi/linux/prctl.h | 2 +
> kernel/sys.c | 27 ++++++++++
> mm/ksm.c | 90 +++++++++++++++++++++++-----------
> 5 files changed, 101 insertions(+), 33 deletions(-)
>
> diff --git a/include/linux/ksm.h b/include/linux/ksm.h
> index 7e232ba59b86..d38a05a36298 100644
> --- a/include/linux/ksm.h
> +++ b/include/linux/ksm.h
> @@ -18,20 +18,24 @@
> #ifdef CONFIG_KSM
> int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
> unsigned long end, int advice, unsigned long *vm_flags);
> -int __ksm_enter(struct mm_struct *mm);
> -void __ksm_exit(struct mm_struct *mm);
> +int __ksm_enter(struct mm_struct *mm, int flag);
> +void __ksm_exit(struct mm_struct *mm, int flag);
>
> static inline int ksm_fork(struct mm_struct *mm, struct mm_struct *oldmm)
> {
> + if (test_bit(MMF_VM_MERGE_ANY, &oldmm->flags))
> + return __ksm_enter(mm, MMF_VM_MERGE_ANY);
> if (test_bit(MMF_VM_MERGEABLE, &oldmm->flags))
> - return __ksm_enter(mm);
> + return __ksm_enter(mm, MMF_VM_MERGEABLE);
> return 0;
> }
>
> static inline void ksm_exit(struct mm_struct *mm)
> {
> - if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> - __ksm_exit(mm);
> + if (test_bit(MMF_VM_MERGE_ANY, &mm->flags))
> + __ksm_exit(mm, MMF_VM_MERGE_ANY);
> + else if (test_bit(MMF_VM_MERGEABLE, &mm->flags))
> + __ksm_exit(mm, MMF_VM_MERGEABLE);
> }
>
> /*
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 0e17ae7fbfd3..0ee96ea7a0e9 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -90,4 +90,5 @@ static inline int get_dumpable(struct mm_struct *mm)
> #define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK |\
> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK)
>
> +#define MMF_VM_MERGE_ANY 29
> #endif /* _LINUX_SCHED_COREDUMP_H */
> diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> index 1312a137f7fb..759b3f53e53f 100644
> --- a/include/uapi/linux/prctl.h
> +++ b/include/uapi/linux/prctl.h
> @@ -290,4 +290,6 @@ struct prctl_mm_map {
> #define PR_SET_VMA 0x53564d41
> # define PR_SET_VMA_ANON_NAME 0
>
> +#define PR_SET_MEMORY_MERGE 67
> +#define PR_GET_MEMORY_MERGE 68
> #endif /* _LINUX_PRCTL_H */
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 495cd87d9bf4..edc439b1cae9 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -15,6 +15,7 @@
> #include <linux/highuid.h>
> #include <linux/fs.h>
> #include <linux/kmod.h>
> +#include <linux/ksm.h>
> #include <linux/perf_event.h>
> #include <linux/resource.h>
> #include <linux/kernel.h>
> @@ -2661,6 +2662,32 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> case PR_SET_VMA:
> error = prctl_set_vma(arg2, arg3, arg4, arg5);
> break;
> +#ifdef CONFIG_KSM
> + case PR_SET_MEMORY_MERGE:
> + if (!capable(CAP_SYS_RESOURCE))
> + return -EPERM;
> +
> + if (arg2) {
> + if (mmap_write_lock_killable(me->mm))
> + return -EINTR;
> +
> + if (!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags))
> + error = __ksm_enter(me->mm, MMF_VM_MERGE_ANY);
Hm, I think this might be problematic if we alread called __ksm_enter()
via madvise(). Maybe we should really consider making MMF_VM_MERGE_ANY
set MMF_VM_MERGABLE instead. Like:
error = 0;
if(test_bit(MMF_VM_MERGEABLE, &me->mm->flags))
error = __ksm_enter(me->mm);
if (!error)
set_bit(MMF_VM_MERGE_ANY, &me->mm->flags);
> + mmap_write_unlock(me->mm);
> + } else {
> + __ksm_exit(me->mm, MMF_VM_MERGE_ANY);
Hm, I'd prefer if we really only call __ksm_exit() when we really exit
the process. Is there a strong requirement to optimize disabling of KSM
or would it be sufficient to clear the MMF_VM_MERGE_ANY flag here?
Also, I wonder what happens if we have another VMA in that process that
has it enabled ..
Last but not least, wouldn't we want to do the same thing as
MADV_UNMERGEABLE and actually unmerge the KSM pages?
It smells like it could be simpler and more consistent to handle by
letting PR_SET_MEMORY_MERGE piggy-back on MMF_VM_MERGABLE/VM_MERGABLE
and mimic what ksm_madvise() does simply for all VMAs.
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -534,16 +534,58 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr,
> return (ret & VM_FAULT_OOM) ? -ENOMEM : 0;
> }
>
> +static bool vma_ksm_compatible(struct vm_area_struct *vma)
> +{
> + /*
> + * Be somewhat over-protective for now!
> + */
> + if (vma->vm_flags & (VM_MERGEABLE | VM_SHARED | VM_MAYSHARE |
> + VM_PFNMAP | VM_IO | VM_DONTEXPAND |
> + VM_HUGETLB | VM_MIXEDMAP))
> + return false; /* just ignore the advice */
That comment is kind-of stale and ksm_madvise() specific.
> +
The VM_MERGEABLE check is really only used for ksm_madvise() to return
immediately. I suggest keeping it in ksm_madvise() -- "Already enabled".
Returning "false" in that case looks wrong (it's not broken because you
do an early check in vma_ksm_mergeable(), it's just semantically weird).
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2023-04-03 10:38 UTC|newest]
Thread overview: 36+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-03-10 18:28 [PATCH v4 0/3] mm: process/cgroup ksm support Stefan Roesch
2023-03-10 18:28 ` [PATCH v4 1/3] mm: add new api to enable ksm per process Stefan Roesch
2023-03-13 16:26 ` Johannes Weiner
2023-04-03 10:37 ` David Hildenbrand [this message]
2023-04-03 11:03 ` David Hildenbrand
2023-04-04 16:32 ` Stefan Roesch
2023-04-04 16:43 ` Stefan Roesch
2023-04-05 6:51 ` Christian Borntraeger
2023-04-05 16:04 ` David Hildenbrand
2023-04-03 15:50 ` Stefan Roesch
2023-04-03 17:02 ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 2/3] mm: add new KSM process and sysfs knobs Stefan Roesch
2023-04-05 17:04 ` David Hildenbrand
2023-04-05 21:20 ` Stefan Roesch
2023-04-06 13:23 ` David Hildenbrand
2023-04-06 14:16 ` Johannes Weiner
2023-04-06 14:32 ` David Hildenbrand
2023-03-10 18:28 ` [PATCH v4 3/3] selftests/mm: add new selftests for KSM Stefan Roesch
2023-03-15 20:03 ` [PATCH v4 0/3] mm: process/cgroup ksm support David Hildenbrand
2023-03-15 20:23 ` Mike Kravetz
2023-03-15 21:05 ` Johannes Weiner
2023-03-15 21:19 ` Johannes Weiner
2023-03-15 21:45 ` David Hildenbrand
2023-03-15 21:47 ` David Hildenbrand
2023-03-30 16:19 ` Stefan Roesch
2023-03-28 23:09 ` Andrew Morton
2023-03-30 4:55 ` David Hildenbrand
2023-03-30 14:26 ` Johannes Weiner
2023-03-30 14:40 ` David Hildenbrand
2023-03-30 16:41 ` Stefan Roesch
2023-04-03 9:48 ` David Hildenbrand
2023-04-03 16:34 ` Stefan Roesch
2023-04-03 17:04 ` David Hildenbrand
2023-04-06 16:59 ` Stefan Roesch
2023-04-06 17:10 ` David Hildenbrand
2023-03-30 20:18 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7ed4308d-b400-d2bb-b539-3fe418862ab8@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=bagasdotme@gmail.com \
--cc=hannes@cmpxchg.org \
--cc=kernel-team@fb.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mhocko@suse.com \
--cc=riel@surriel.com \
--cc=shr@devkernel.io \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox