From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7BABCC6FD1D for ; Tue, 4 Apr 2023 16:39:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CA1566B0072; Tue, 4 Apr 2023 12:39:15 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C2A2B6B0074; Tue, 4 Apr 2023 12:39:15 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A7CC16B0075; Tue, 4 Apr 2023 12:39:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 905806B0072 for ; Tue, 4 Apr 2023 12:39:15 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id 6308C1C5C63 for ; Tue, 4 Apr 2023 16:39:15 +0000 (UTC) X-FDA: 80644268670.20.4DAE37D Received: from out1-smtp.messagingengine.com (out1-smtp.messagingengine.com [66.111.4.25]) by imf17.hostedemail.com (Postfix) with ESMTP id 2C9F340016 for ; Tue, 4 Apr 2023 16:39:11 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=devkernel.io header.s=fm3 header.b=UYgSwTBi; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=NONQMvP3; spf=pass (imf17.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.25 as permitted sender) smtp.mailfrom=shr@devkernel.io; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680626352; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xDgnnIfd8SPLyCsNmcqak2aFntaDCdlHN6LMTGXJlDo=; b=bdJdHOwB2DKxvJTOleHeDFSFdalH244kdezmzFnQ4bVSlPzNZSqROt5Vkts+rbBKaYbQ3R /IZ8/wigv/NymeY/Kae+ESBSREjV+T/RYuaSbuXUY9DRvPAlQoAZbtvXpRLVhAQ/e7hY8L T1lfSC9X+zq/2RpG52XLA8vVuurz39g= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=devkernel.io header.s=fm3 header.b=UYgSwTBi; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=NONQMvP3; spf=pass (imf17.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.25 as permitted sender) smtp.mailfrom=shr@devkernel.io; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680626352; a=rsa-sha256; cv=none; b=ZtbpSWEFmCPjjl2vwCwQODK04w4we2XLWe33ey2T8tF+jM8XW/vT1TVim2rsKfntwAD7t9 aTAlehOyW99T9003wJ463CX42YdNiA6tRzPA0+ghywdBtaKOFWOJlPVvomk6drkX4uI5MU JlTNUUmit+R1DVhCk52EEChn1vk0xak= Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id E6B2F5C0140; Tue, 4 Apr 2023 12:39:06 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute1.internal (MEProxy); Tue, 04 Apr 2023 12:39:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=devkernel.io; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm3; t=1680626346; x=1680712746; bh=xD gnnIfd8SPLyCsNmcqak2aFntaDCdlHN6LMTGXJlDo=; b=UYgSwTBiikwGwSAWMk Cjl4zAa5Rm85E+xkMetwhHA00hqL4UFC2fyJCraTtSuqKaCl+Jas4ruQjRkzT5kg dbz605XpY6sDfHcaaqNTtlEPkv94Jn4Q4bNeWJjLKu25izc9KVE36oSj/DCXvgyn kCkR7jeJzx5tt/rcEV/xcQ9wxpeB0vSf63+wPfel7raUOHc0vHhiQiazZtGZ80gP 2xHNjyTPJdJFTfOmQniPh/d3ShORUSvbKOCAFDQk0MSKlJde++67TiSP3qraTAjK f4azYd8rNUohMUMtxL/aL74Ok84aHj/KrjyEbUa96iWk4Vi5MeJxr9WaGAJMadVZ /8TQ== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1680626346; x=1680712746; bh=xDgnnIfd8SPLy CsNmcqak2aFntaDCdlHN6LMTGXJlDo=; b=NONQMvP3jdt5vkhdiP7o4o/o98pTR HmVaGyyHtMQz8gupYvoeWZk54ZYygBEg08+6ITVE0QbJFqfyPuOTmVZzOKe395hL yfYZv0jTE7FsVTneExTqGJEqERFSyd1sIA85jshmRP6ViEkF16aHXu/uDy36jpvh x18mNkNB1tTtK3IunNsAUUv87Px4akfBJ8D0vqy8Oc2RfDzf6QO/TjluPE0VWAAA oPKImEAXb2B4XPF6D7fietVHqk7JB4CNYEJc3507UOqGsa9uFj4wUD/+bDK2HSuX N3ECZIodFBMDMx5KkVgnI4HMJvHxbvgqL9ge/gl97Ke6djUgRRKZLo6Pw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdeiledguddtfecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpehffgfhvfevufffjgfkgggtsehttdertddtredtnecuhfhrohhmpefuthgv fhgrnhcutfhovghstghhuceoshhhrhesuggvvhhkvghrnhgvlhdrihhoqeenucggtffrrg htthgvrhhnpeevlefggffhheduiedtheejveehtdfhtedvhfeludetvdegieekgeeggfdu geeutdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfhhrohhmpe hshhhrseguvghvkhgvrhhnvghlrdhioh X-ME-Proxy: Feedback-ID: i84614614:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 4 Apr 2023 12:39:05 -0400 (EDT) References: <20230310182851.2579138-1-shr@devkernel.io> <20230310182851.2579138-2-shr@devkernel.io> <7ed4308d-b400-d2bb-b539-3fe418862ab8@redhat.com> User-agent: mu4e 1.6.11; emacs 28.2.50 From: Stefan Roesch To: David Hildenbrand Cc: kernel-team@fb.com, linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, akpm@linux-foundation.org, hannes@cmpxchg.org, Bagas Sanjaya , Janosch Frank , Christian Borntraeger Subject: Re: [PATCH v4 1/3] mm: add new api to enable ksm per process Date: Tue, 04 Apr 2023 09:32:31 -0700 In-reply-to: Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: 2C9F340016 X-Rspam-User: X-Stat-Signature: yg4pewxef16jt4nofhwp4fkmyqennku6 X-HE-Tag: 1680626351-909656 X-HE-Meta: U2FsdGVkX18EqNxcyQHYCkxoqDc8N5gpMAn2+XMyGjAWxWebnxDvhOehPKg36D3M0yyGMOzZjnTz6yRJj7qHbSKEXAyH88GiVfXzbyQagnQtDt2mmDhFchVm1umf0R85gX86nJmbH64sJeAYOErSWYZGWvibD3zE4h9R5/e8A8u7PiZDfZ1CKp6U0jFKOZjImpXHRXy2pL3SZ1VcZH6/mOAAYl3SncblaX98VlDXAMaV1DUrHxO9xjanIl9aFfqGZZG9MUOOtZEXzd4vBuvDdXnov7gfUnP26vWXVQiCTmF3Rg6cFZf0PljVlzkpWWg1xxe618VNMpa8b8Jg4hzsSyR2si+ENiuPbFUlxDbkGFFwUR6xiwFJH7HDtWmP9mer/J29b0U8YdRkbW4IR+jR1kFTommZUjd7p966qcXW/BSoSZCwkDvb+Wfn0En1BMCrBAhoo94SVH6UrccCRjAta3xXJROeeD9SqnzIjc8QkU2xcatWTi6MkfiWHvqGaa4NqxxjPnXKuOTHfxceg7HCAdbxdgJ84hZ5jAWyFR/aSG8a27A85hebWKT/Zk6lrSrWqHBa5svwAv0+9QuXvTjKvTJaGWO6lPDoye5vyFFF/4YPwdv6+8A6ax4djIlwDZoUAcNEqkFsbJQqsFRgw5FumrAeJY/5eL8mnji4yjQ1O+EkwsX0YGqGmkKSzlW34CG2GU5eXY8U3y3fHrAuBre8REpUoTA9Z9IJ0e9Pxj4WnKvF+SnUQ3IzJIz2ixz+iDa7Luk0OUnhiUIKJNbR9dMvXvFYKSKF7ltuGtohLOZGUo9vw5UCFmU1q0nDbcKlyctdmlccnc4w/X0PTEKmGdw3RriRTeGHJTjmqCn3talCTo97lEmkk1rGDl0NTjU+gtjaS1gsOTBc3aJ4SaSo6ADXMNGx84aISbkX1WWGgalm39jN7Yvyc0fdd5GH9yfbtLIA/DaCh/tJUeETLJ4gbmX Vl0SW1gs E0ZVt2NsUMoFddtdEEpORQAcg2rorjk9y2IMt6I2mpbloqR/2/8hF+ak8OcfkOLcD/NbigWqzHhN7pQJLKeR6fnLTsayZDKzMfyxK0xLYfAC5AkRIm5Ubm4zLLHyUSkTM33ncBI1EwJSVeJ5uMz3dfdOGXmWgZ1NaQ3E6 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: David Hildenbrand writes: > On 03.04.23 12:37, David Hildenbrand wrote: >> On 10.03.23 19:28, Stefan Roesch wrote: >>> Patch series "mm: process/cgroup ksm support", v3. >>> >>> So far KSM can only be enabled by calling madvise for memory regions. To >>> be able to use KSM for more workloads, KSM needs to have the ability to be >>> enabled / disabled at the process / cgroup level. >>> >>> Use case 1: >>> >>> The madvise call is not available in the programming language. An >>> example for this are programs with forked workloads using a garbage >>> collected language without pointers. In such a language madvise cannot >>> be made available. >>> >>> In addition the addresses of objects get moved around as they are >>> garbage collected. KSM sharing needs to be enabled "from the outside" >>> for these type of workloads. >> I guess the interpreter could enable it (like a memory allocator could >> enable it for the whole heap). But I get that it's much easier to enable >> this per-process, and eventually only when a lot of the same processes >> are running in that particular environment. >> >>> >>> Use case 2: >>> >>> The same interpreter can also be used for workloads where KSM brings >>> no benefit or even has overhead. We'd like to be able to enable KSM on >>> a workload by workload basis. >> Agreed. A per-process control is also helpful to identidy workloads >> where KSM might be beneficial (and to which degree). >> >>> >>> Use case 3: >>> >>> With the madvise call sharing opportunities are only enabled for the >>> current process: it is a workload-local decision. A considerable number >>> of sharing opportuniites may exist across multiple workloads or jobs. >>> Only a higler level entity like a job scheduler or container can know >>> for certain if its running one or more instances of a job. That job >>> scheduler however doesn't have the necessary internal worklaod knowledge >>> to make targeted madvise calls. >>> >>> Security concerns: >>> >>> In previous discussions security concerns have been brought up. The >>> problem is that an individual workload does not have the knowledge about >>> what else is running on a machine. Therefore it has to be very >>> conservative in what memory areas can be shared or not. However, if the >>> system is dedicated to running multiple jobs within the same security >>> domain, its the job scheduler that has the knowledge that sharing can be >>> safely enabled and is even desirable. >>> >>> Performance: >>> >>> Experiments with using UKSM have shown a capacity increase of around >>> 20%. >>> >> As raised, it would be great to include more details about the workload >> where this particulalry helps (e.g., a lot of Django processes operating >> in the same domain). >> >>> >>> 1. New options for prctl system command >>> >>> This patch series adds two new options to the prctl system call. >>> The first one allows to enable KSM at the process level and the second >>> one to query the setting. >>> >>> The setting will be inherited by child processes. >>> >>> With the above setting, KSM can be enabled for the seed process of a >>> cgroup and all processes in the cgroup will inherit the setting. >>> >>> 2. Changes to KSM processing >>> >>> When KSM is enabled at the process level, the KSM code will iterate >>> over all the VMA's and enable KSM for the eligible VMA's. >>> >>> When forking a process that has KSM enabled, the setting will be >>> inherited by the new child process. >>> >>> In addition when KSM is disabled for a process, KSM will be disabled >>> for the VMA's where KSM has been enabled. >> Do we want to make MADV_MERGEABLE/MADV_UNMERGEABLE fail while the new >> prctl is enabled for a process? >> >>> >>> 3. Add general_profit metric >>> >>> The general_profit metric of KSM is specified in the documentation, >>> but not calculated. This adds the general profit metric to >>> /sys/kernel/debug/mm/ksm. >>> >>> 4. Add more metrics to ksm_stat >>> >>> This adds the process profit and ksm type metric to >>> /proc//ksm_stat. >>> >>> 5. Add more tests to ksm_tests >>> >>> This adds an option to specify the merge type to the ksm_tests. >>> This allows to test madvise and prctl KSM. It also adds a new option >>> to query if prctl KSM has been enabled. It adds a fork test to verify >>> that the KSM process setting is inherited by client processes. >>> >>> An update to the prctl(2) manpage has been proposed at [1]. >>> >>> This patch (of 3): >>> >>> This adds a new prctl to API to enable and disable KSM on a per process >>> basis instead of only at the VMA basis (with madvise). >>> >>> 1) Introduce new MMF_VM_MERGE_ANY flag >>> >>> This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag >>> is set, kernel samepage merging (ksm) gets enabled for all vma's of a >>> process. >>> >>> 2) add flag to __ksm_enter >>> >>> This change adds the flag parameter to __ksm_enter. This allows to >>> distinguish if ksm was called by prctl or madvise. >>> >>> 3) add flag to __ksm_exit call >>> >>> This adds the flag parameter to the __ksm_exit() call. This allows >>> to distinguish if this call is for an prctl or madvise invocation. >>> >>> 4) invoke madvise for all vmas in scan_get_next_rmap_item >>> >>> If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate >>> over all the vmas and enable ksm if possible. For the vmas that can be >>> ksm enabled this is only done once. >>> >>> 5) support disabling of ksm for a process >>> >>> This adds the ability to disable ksm for a process if ksm has been >>> enabled for the process. >>> >>> 6) add new prctl option to get and set ksm for a process >>> >>> This adds two new options to the prctl system call >>> - enable ksm for all vmas of a process (if the vmas support it). >>> - query if ksm has been enabled for a process. >> Did you consider, instead of handling MMF_VM_MERGE_ANY in a special way, >> to instead make it reuse the existing MMF_VM_MERGEABLE/VM_MERGEABLE >> infrastructure. Especially: >> 1) During prctl(MMF_VM_MERGE_ANY), set VM_MERGABLE on all applicable >> compatible. Further, set MMF_VM_MERGEABLE and enter KSM if not >> already set. >> 2) When creating a new, compatible VMA and MMF_VM_MERGE_ANY is set, set >> VM_MERGABLE? >> The you can avoid all runtime checks for compatible VMAs and only look >> at the VM_MERGEABLE flag. In fact, the VM_MERGEABLE will be completely >> expressive then for all VMAs. You don't need vma_ksm_mergeable() then. >> Another thing to consider is interaction with arch/s390/mm/gmap.c: >> s390x/kvm does not support KSM and it has to disable it for all VMAs. We >> have to find a way to fence the prctl (for example, fail setting the >> prctl after gmap_mark_unmergeable() ran, and make >> gmap_mark_unmergeable() fail if the prctl ran -- or handle it gracefully >> in some other way). gmap_mark_unmergeable() seems to have a problem today. We can execute gmap_mark_unmergeable() and mark the vma's as unmergeable, but shortly after that the process can run madvise on it again and make it mergeable. Am I mssing something here? Once prctl is run, we can check for the MMF_VM_MERGE_ANY flag in gmap_mark_unmergeable(). In case it is set, we can return an error. The error code path looks like it can handle that case. For the opposite case: gmap_mark_unmergeable() has already been run, we would need some kind of flag or other means to be able to detect it. Any recommendations? > > Staring at that code, I wonder if the "mm->def_flags &= ~VM_MERGEABLE" is doing > what it's supposed to do. I don't think this effectively prevents right now > madvise() from getting re-enabled on that VMA. > > @Christian, Janosch, am I missing something?