From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 595C5C6FD1D
	for <linux-mm@archiver.kernel.org>; Tue,  4 Apr 2023 16:44:13 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C4ECF6B0072; Tue,  4 Apr 2023 12:44:12 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id BFF626B0074; Tue,  4 Apr 2023 12:44:12 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id A9FB46B0075; Tue,  4 Apr 2023 12:44:12 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 991D16B0072
	for <linux-mm@kvack.org>; Tue,  4 Apr 2023 12:44:12 -0400 (EDT)
Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id 4CA4F160ECF
	for <linux-mm@kvack.org>; Tue,  4 Apr 2023 16:44:12 +0000 (UTC)
X-FDA: 80644281144.20.08F3179
Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25])
	by imf14.hostedemail.com (Postfix) with ESMTP id 4FFC810001C
	for <linux-mm@kvack.org>; Tue,  4 Apr 2023 16:44:10 +0000 (UTC)
Authentication-Results: imf14.hostedemail.com;
	dkim=pass header.d=devkernel.io header.s=fm3 header.b=nogd+x4T;
	dkim=pass header.d=messagingengine.com header.s=fm2 header.b=bKbu4U2l;
	spf=pass (imf14.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.25 as permitted sender) smtp.mailfrom=shr@devkernel.io;
	dmarc=none
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1680626650;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=IKXJaj89zCSXZFUdNfO6r9AvyGexfDFn5WHbwUeaEBM=;
	b=8mGWDq9bVpwILolugXTq48pweIBIJMoxilU8n/97GUdi+oHBnR5FEdWB5FfWxoSd4Js9BW
	ZoGU4JCP5o+zmT+PyFCSqwHrPrSZTzLvT5rY1PzmZPjoCi+/7DFWFVK+IVkdmsLp6TCevY
	aPwnGEbICnLRzRC9TlgPpDKYQn6J6zI=
ARC-Authentication-Results: i=1;
	imf14.hostedemail.com;
	dkim=pass header.d=devkernel.io header.s=fm3 header.b=nogd+x4T;
	dkim=pass header.d=messagingengine.com header.s=fm2 header.b=bKbu4U2l;
	spf=pass (imf14.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.25 as permitted sender) smtp.mailfrom=shr@devkernel.io;
	dmarc=none
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680626650; a=rsa-sha256;
	cv=none;
	b=l4aWfkLyMbBpk0wCA8GwWHM5B6mn0Q0/yEiqynP4xycE9Uf6oAnEmOXXwC0mggyBELb6gn
	cZiNaiN2JGw44qHcVcODXAaDvHhVVvoRYv9gKJ9FnUl9N+sbfh1XxM5xcZOO+1yp7C8dPV
	FeLFCXIKxw0cnFCHcNy6v/6w05ou/ek=
Received: from compute1.internal (compute1.nyi.internal [10.202.2.41])
	by mailout.nyi.internal (Postfix) with ESMTP id A7CD75C0161;
	Tue,  4 Apr 2023 12:44:09 -0400 (EDT)
Received: from mailfrontend1 ([10.202.2.162])
  by compute1.internal (MEProxy); Tue, 04 Apr 2023 12:44:09 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=devkernel.io; h=
	cc:cc:content-type:content-type:date:date:from:from:in-reply-to
	:in-reply-to:message-id:mime-version:references:reply-to:sender
	:subject:subject:to:to; s=fm3; t=1680626649; x=1680713049; bh=IK
	XJaj89zCSXZFUdNfO6r9AvyGexfDFn5WHbwUeaEBM=; b=nogd+x4TNGtfnYDixP
	mLJ0h9iPq0OR4mUd95f82XEyVDZ91/BA9S4UWApGdKwtr1UUV8U96aqOEsckNGWB
	babOafjo7hhaZj7EhETl6daEQ8VJRHGgJH1sP1hZ2LUDRxAGbWoLZsBNGeqtnPh1
	EJlPm6quwzCfgX9JMRzGDnJstGYPA9C/JdVn8nLlgH2pcMeEtd1vWCa3mTxk1Z2V
	Jw0H6gAf+Y99ovqC+2WPIbExmpAVv0ngtMgFVmQ2/nbbC3s9AaNDkZR3D/0dbyNT
	2IMzGnIPyf1NGGN1fYGuipbXbY5HcZWLLUjEugdIMECb2RUE9C6FCcz17JnDstlL
	Fn6A==
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=
	messagingengine.com; h=cc:cc:content-type:content-type:date:date
	:feedback-id:feedback-id:from:from:in-reply-to:in-reply-to
	:message-id:mime-version:references:reply-to:sender:subject
	:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender
	:x-sasl-enc; s=fm2; t=1680626649; x=1680713049; bh=IKXJaj89zCSXZ
	FUdNfO6r9AvyGexfDFn5WHbwUeaEBM=; b=bKbu4U2l7TwF0iaxR203gFnj8QQcl
	ljLNvwUsmqFAxy3nBQzaW0gUPVwbxEuCTF49jNrJ791ZLbg+8rh+x5ITIZEIRtWO
	X927cSGxx9T74LT6slONX23GitMEtKzbOdt+Z/3UviQWyNBNsTYCoEhgGy06xR+q
	95jyHTXv+FrKnUabi93+QK0HArHlYxqCR8TqxS7Q1W/SFJJWiSA/GCTa+gd9Glsg
	Ljj3wz/TU20RZkKxz5Aseft39Hpa0OX1q6KdMDtUpd7Bc6mc14M1bVo20s5JBcHb
	wgGxVMFO844X+BODKaYUugsUJJ8hgZHV0nMVvNva8YQTdB/lu5L0h9lMg==
X-ME-Sender: <xms:2VMsZBVEvynioBVinq8cL5tw4v0ZHbGEWO2BBa2la0ZJuIJM0rekXQ>
    <xme:2VMsZBn37N6f7MNoCWGaGyVBtb8JXSF641IUY24RyLKD3k51fJpZESYAkUeyNL_xM
    IWvYlXOPDm7kPaTz04>
X-ME-Received: <xmr:2VMsZNYDUkEXsbHlFwgbAjoQI78gKI7Ub7LH-MBsIdiUnNvfAHHKwvPO3YMDK5CMUmyutBSl7PAzQLTQ-V78y1IQXnQfxnwdwo2fePbW>
X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdeiledguddthecutefuodetggdotefrod
    ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh
    necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd
    enucfjughrpehffgfhvfevufffjgfkgggtsehttdertddtredtnecuhfhrohhmpefuthgv
    fhgrnhcutfhovghstghhuceoshhhrhesuggvvhhkvghrnhgvlhdrihhoqeenucggtffrrg
    htthgvrhhnpeevlefggffhheduiedtheejveehtdfhtedvhfeludetvdegieekgeeggfdu
    geeutdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe
    hshhhrseguvghvkhgvrhhnvghlrdhioh
X-ME-Proxy: <xmx:2VMsZEVSieEwr4lFqtUgsN2eSWCCRTT5UgcxknT5WEWEwuP1RqNRWQ>
    <xmx:2VMsZLnJQVE2p8_cV4CdEQVeWT5SUHL7zCacAnSlZ60dVTW9OkpeRQ>
    <xmx:2VMsZBe2sZlbnF1N_73rKQxQiRQsv8HpJFZNURJItn9V2KQ9fCVv9g>
    <xmx:2VMsZEc8gn1gXjzCb0NofGYyLMuWPshY_tXJOwoKOcUpHs54bSavQg>
Feedback-ID: i84614614:Fastmail
Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue,
 4 Apr 2023 12:44:07 -0400 (EDT)
References: <20230310182851.2579138-1-shr@devkernel.io>
 <20230310182851.2579138-2-shr@devkernel.io>
 <7ed4308d-b400-d2bb-b539-3fe418862ab8@redhat.com>
 <e888871b-9f48-c01d-ce7f-f32ec3d79ef8@redhat.com>
User-agent: mu4e 1.6.11; emacs 28.2.50
From: Stefan Roesch <shr@devkernel.io>
To: David Hildenbrand <david@redhat.com>
Cc: kernel-team@fb.com, linux-mm@kvack.org, riel@surriel.com,
 mhocko@suse.com, linux-kselftest@vger.kernel.org,
 linux-doc@vger.kernel.org, akpm@linux-foundation.org, hannes@cmpxchg.org,
 Bagas Sanjaya <bagasdotme@gmail.com>, Janosch Frank
 <frankja@linux.ibm.com>, Christian Borntraeger <borntraeger@de.ibm.com>
Subject: Re: [PATCH v4 1/3] mm: add new api to enable ksm per process
Date: Tue, 04 Apr 2023 09:43:51 -0700
In-reply-to: <e888871b-9f48-c01d-ce7f-f32ec3d79ef8@redhat.com>
Message-ID: <qvqwzg7nha21.fsf@dev0134.prn3.facebook.com>
MIME-Version: 1.0
Content-Type: text/plain
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 4FFC810001C
X-Stat-Signature: yg4pewxef16jt4nofhwp4fkmyqennku6
X-Rspam-User: 
X-HE-Tag: 1680626650-562401
X-HE-Meta: U2FsdGVkX1/zvHKx3OkyeVy6qfi2erk1P5u6DyN3y906aLl1Oyadrj8eT46ENs6HYEui7tZnmaVPX9g5v3mYa0xiBAD9BHKgz4Ma4tOsvE+izRTykC5lUutt5B1Baq3jxEgZMtCFCm9la6+A2QO1rBUvnzmnxhGmPBbFNsB8aHrv9Wd2xoL6zU44dtnbTeldlWBnPc4CnNEbp/9lLRzr8fAZLUmK8HJU93KZBVS22RlWhSjKESOng3sV8eBijN6IBcYGRmrgkz35IamAenmfd4yNfq/dUotXqIw++V49H73WS6iR00SuEGT7bTe1nGdsyquC1aegyViiog8nCqrF/XgoIMK2pqhLhQdGAxERAVIadhEeQLyM3IHtm7zQPov2UHAkyHt7LT3Ud94Fu6nVdQn3fY5/y0AqoSyBSn9nD+MkQ3j3ZMb4FefYM84AYk6CGNdYrh/+Yhk5i/3EpHJCH6CH/qnqYLVRfyM2TFouKQEOmQBHKILvJSCc5ivX8MuOG0p5IkquONFuAkk6VT4C5pRJq9bC4FrMRrx8HfcZmK68ZD5Q0EPLdM6NDsp5BxKUl8RFfVNgZJUxXqXDnYC6XHft4Hq32UMqjaEgJy4ubxbjv1Ow6tdbvcveyENtVespHcgoCZ3RnpL56K9/Var3lYctNBgTHPngbl9rGbN+CKBQ4qzep+dJP9PzwgFRUlybbEks4jDBNx7Ds4BjglLN3XdWT2PDkES34Y1DPvZ2yCzDjBJIcbUcUhXJCdTIjDwtin6RlE/lMPiheP/SQP+Pxefyzf5H+35ajDtV7+1Bhvw9xGoTNfPiPgADauWu6y6S33+O3GuKtZjxS42/P46GHMhfak7ZooLYNHILwvUHy7lS4pfuoMrjs6RjQlwx0XLabgiliDOGrG/lPJ6htldE691I1O056ebth3hCIHPIZkql9e3I9kVtI4M27y9NpiK2L3Q9srU1FKigkzLNh5D
 B7VQVeEB
 LH9sVgLnO+4wwLnimBWdVjo1P0KXku5MzCWzkZSr2vFyWIGGP9w+dpmGdRvI4f41ZwII3jcvXkdj6HfJfFnw6x/FCOOg/nGJW6hG7iMAnRTYAn/s6S+R/WtzWNnizXJCvQe6D66fmTScSeUcEQ8eZCXkPO89yTH0Mhn5n
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>


David Hildenbrand <david@redhat.com> writes:

> On 03.04.23 12:37, David Hildenbrand wrote:
>> On 10.03.23 19:28, Stefan Roesch wrote:
>>> Patch series "mm: process/cgroup ksm support", v3.
>>>
>>> So far KSM can only be enabled by calling madvise for memory regions.  To
>>> be able to use KSM for more workloads, KSM needs to have the ability to be
>>> enabled / disabled at the process / cgroup level.
>>>
>>> Use case 1:
>>>
>>>     The madvise call is not available in the programming language.  An
>>>     example for this are programs with forked workloads using a garbage
>>>     collected language without pointers.  In such a language madvise cannot
>>>     be made available.
>>>
>>>     In addition the addresses of objects get moved around as they are
>>>     garbage collected.  KSM sharing needs to be enabled "from the outside"
>>>     for these type of workloads.
>> I guess the interpreter could enable it (like a memory allocator could
>> enable it for the whole heap). But I get that it's much easier to enable
>> this per-process, and eventually only when a lot of the same processes
>> are running in that particular environment.
>>
>>>
>>> Use case 2:
>>>
>>>     The same interpreter can also be used for workloads where KSM brings
>>>     no benefit or even has overhead.  We'd like to be able to enable KSM on
>>>     a workload by workload basis.
>> Agreed. A per-process control is also helpful to identidy workloads
>> where KSM might be beneficial (and to which degree).
>>
>>>
>>> Use case 3:
>>>
>>>     With the madvise call sharing opportunities are only enabled for the
>>>     current process: it is a workload-local decision.  A considerable number
>>>     of sharing opportuniites may exist across multiple workloads or jobs.
>>>     Only a higler level entity like a job scheduler or container can know
>>>     for certain if its running one or more instances of a job.  That job
>>>     scheduler however doesn't have the necessary internal worklaod knowledge
>>>     to make targeted madvise calls.
>>>
>>> Security concerns:
>>>
>>>     In previous discussions security concerns have been brought up.  The
>>>     problem is that an individual workload does not have the knowledge about
>>>     what else is running on a machine.  Therefore it has to be very
>>>     conservative in what memory areas can be shared or not.  However, if the
>>>     system is dedicated to running multiple jobs within the same security
>>>     domain, its the job scheduler that has the knowledge that sharing can be
>>>     safely enabled and is even desirable.
>>>
>>> Performance:
>>>
>>>     Experiments with using UKSM have shown a capacity increase of around
>>>     20%.
>>>
>> As raised, it would be great to include more details about the workload
>> where this particulalry helps (e.g., a lot of Django processes operating
>> in the same domain).
>>
>>>
>>> 1. New options for prctl system command
>>>
>>>      This patch series adds two new options to the prctl system call.
>>>      The first one allows to enable KSM at the process level and the second
>>>      one to query the setting.
>>>
>>>      The setting will be inherited by child processes.
>>>
>>>      With the above setting, KSM can be enabled for the seed process of a
>>>      cgroup and all processes in the cgroup will inherit the setting.
>>>
>>> 2. Changes to KSM processing
>>>
>>>      When KSM is enabled at the process level, the KSM code will iterate
>>>      over all the VMA's and enable KSM for the eligible VMA's.
>>>
>>>      When forking a process that has KSM enabled, the setting will be
>>>      inherited by the new child process.
>>>
>>>      In addition when KSM is disabled for a process, KSM will be disabled
>>>      for the VMA's where KSM has been enabled.
>> Do we want to make MADV_MERGEABLE/MADV_UNMERGEABLE fail while the new
>> prctl is enabled for a process?
>>
>>>
>>> 3. Add general_profit metric
>>>
>>>      The general_profit metric of KSM is specified in the documentation,
>>>      but not calculated.  This adds the general profit metric to
>>>      /sys/kernel/debug/mm/ksm.
>>>
>>> 4. Add more metrics to ksm_stat
>>>
>>>      This adds the process profit and ksm type metric to
>>>      /proc/<pid>/ksm_stat.
>>>
>>> 5. Add more tests to ksm_tests
>>>
>>>      This adds an option to specify the merge type to the ksm_tests.
>>>      This allows to test madvise and prctl KSM.  It also adds a new option
>>>      to query if prctl KSM has been enabled.  It adds a fork test to verify
>>>      that the KSM process setting is inherited by client processes.
>>>
>>> An update to the prctl(2) manpage has been proposed at [1].
>>>
>>> This patch (of 3):
>>>
>>> This adds a new prctl to API to enable and disable KSM on a per process
>>> basis instead of only at the VMA basis (with madvise).
>>>
>>> 1) Introduce new MMF_VM_MERGE_ANY flag
>>>
>>>      This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
>>>      is set, kernel samepage merging (ksm) gets enabled for all vma's of a
>>>      process.
>>>
>>> 2) add flag to __ksm_enter
>>>
>>>      This change adds the flag parameter to __ksm_enter.  This allows to
>>>      distinguish if ksm was called by prctl or madvise.
>>>
>>> 3) add flag to __ksm_exit call
>>>
>>>      This adds the flag parameter to the __ksm_exit() call.  This allows
>>>      to distinguish if this call is for an prctl or madvise invocation.
>>>
>>> 4) invoke madvise for all vmas in scan_get_next_rmap_item
>>>
>>>      If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate
>>>      over all the vmas and enable ksm if possible.  For the vmas that can be
>>>      ksm enabled this is only done once.
>>>
>>> 5) support disabling of ksm for a process
>>>
>>>      This adds the ability to disable ksm for a process if ksm has been
>>>      enabled for the process.
>>>
>>> 6) add new prctl option to get and set ksm for a process
>>>
>>>      This adds two new options to the prctl system call
>>>      - enable ksm for all vmas of a process (if the vmas support it).
>>>      - query if ksm has been enabled for a process.
>> Did you consider, instead of handling MMF_VM_MERGE_ANY in a special way,
>> to instead make it reuse the existing MMF_VM_MERGEABLE/VM_MERGEABLE
>> infrastructure. Especially:
>> 1) During prctl(MMF_VM_MERGE_ANY), set VM_MERGABLE on all applicable
>>      compatible. Further, set MMF_VM_MERGEABLE and enter KSM if not
>>      already set.
>> 2) When creating a new, compatible VMA and MMF_VM_MERGE_ANY is set, set
>>      VM_MERGABLE?
>> The you can avoid all runtime checks for compatible VMAs and only look
>> at the VM_MERGEABLE flag. In fact, the VM_MERGEABLE will be completely
>> expressive then for all VMAs. You don't need vma_ksm_mergeable() then.
>> Another thing to consider is interaction with arch/s390/mm/gmap.c:
>> s390x/kvm does not support KSM and it has to disable it for all VMAs. We
>> have to find a way to fence the prctl (for example, fail setting the
>> prctl after gmap_mark_unmergeable() ran, and make
>> gmap_mark_unmergeable() fail if the prctl ran -- or handle it gracefully
>> in some other way).

gmap_mark_unmergeable() seems to have a problem today. We can execute
gmap_mark_unmergeable() and mark the vma's as unmergeable, but shortly
after that the process can run madvise on it again and make it
mergeable. Am I mssing something here?

Once prctl is run, we can check for the MMF_VM_MERGE_ANY flag in
gmap_mark_unmergeable(). In case it is set, we can return an error. The
error code path looks like it can handle that case.

For the opposite case: gmap_mark_unmergeable() has already been run, we
would need some kind of flag or other means to be able to detect it.
Any recommendations?

>
>
> Staring at that code, I wonder if the "mm->def_flags &= ~VM_MERGEABLE" is doing
> what it's supposed to do. I don't think this effectively prevents right now
> madvise() from getting re-enabled on that VMA.
>
> @Christian, Janosch, am I missing something?