From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C81BC6FD19 for ; Wed, 8 Mar 2023 22:23:46 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 2B9F06B0072; Wed, 8 Mar 2023 17:23:46 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 26A28280002; Wed, 8 Mar 2023 17:23:46 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 13186280001; Wed, 8 Mar 2023 17:23:46 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 04C4A6B0072 for ; Wed, 8 Mar 2023 17:23:46 -0500 (EST) Received: from smtpin17.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 866A41A10A8 for ; Wed, 8 Mar 2023 22:23:45 +0000 (UTC) X-FDA: 80547159210.17.128E09B Received: from out2-smtp.messagingengine.com (out2-smtp.messagingengine.com [66.111.4.26]) by imf03.hostedemail.com (Postfix) with ESMTP id 640A52000C for ; Wed, 8 Mar 2023 22:23:42 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=devkernel.io header.s=fm2 header.b=lEYGgCBk; dkim=pass header.d=messagingengine.com header.s=fm1 header.b=pOp2qRfT; dmarc=none; spf=pass (imf03.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.26 as permitted sender) smtp.mailfrom=shr@devkernel.io ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1678314222; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=QOwR0D9gPl7Iokbw5axOEkvXsNmPpiw+HASooBf+03M=; b=BIHHNISFxRqU7Aw2kxtOBHk9Mmxx1Gv9hV3N16ilBFQvpzKjPEGfQEEdCmDSdBEbmcxoJd va44TS9GnkUZPv8d7xmFQCWlxWGM2bX5nQg8iauB6jnTCqgqBnse6yz3wHE+lunRoxHRIC xXIyMHAfWe1ZDti669LYbLJXxz6gtPQ= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=devkernel.io header.s=fm2 header.b=lEYGgCBk; dkim=pass header.d=messagingengine.com header.s=fm1 header.b=pOp2qRfT; dmarc=none; spf=pass (imf03.hostedemail.com: domain of shr@devkernel.io designates 66.111.4.26 as permitted sender) smtp.mailfrom=shr@devkernel.io ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1678314222; a=rsa-sha256; cv=none; b=gVi4BGqrrJoAFUiUWiJihu3g6II0AcbXAdGnNz3+DFuO3XRuW6k00hA/FW1x7RN9QG+c7H rW9WVbbRG/Gtsq2BAOmSTRMmSHmvZdPT6zo6liNxnBBb9nMGUl4S9ZhQsRJQtTIuqWwkqz nzgS3vssvRXGGT+/WMUg0EUPi9ZDC4Q= Received: from compute1.internal (compute1.nyi.internal [10.202.2.41]) by mailout.nyi.internal (Postfix) with ESMTP id D430F5C00E4; Wed, 8 Mar 2023 17:23:41 -0500 (EST) Received: from mailfrontend2 ([10.202.2.163]) by compute1.internal (MEProxy); Wed, 08 Mar 2023 17:23:41 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=devkernel.io; h= cc:cc:content-type:content-type:date:date:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to; s=fm2; t=1678314221; x=1678400621; bh=QO wR0D9gPl7Iokbw5axOEkvXsNmPpiw+HASooBf+03M=; b=lEYGgCBk+s6p2F2hDg KMCAS6GDNmPhmF8hzbgrDsWoKCfoCxTm4MCAXu3DzbuD887fuK5qf/vOtNdfenmy 6147lBzeLBmBzolXLHDclVdOjbSxOrw3nU2z2jmjv8Jh313mxtnmijp+60vcLjwI f1p3ce55zBaKvAap2+SNySgwZ7DCjMoqaxBLesXfMlx+BGBlhAFrAb/4u2v4hPvl 5jQy6FXNDrhW7jDnU30wwuH/nFdPZI3rtOemLUCo/+KjNCERlBPZXxUnBbnUsNqA gEJECf6tSyiAqwvAy/9IANtucsrYL3RgvMf+VSjxal/SA4g0EuwW8pVr74AJQWMH JqJA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm1; t=1678314221; x=1678400621; bh=QOwR0D9gPl7Io kbw5axOEkvXsNmPpiw+HASooBf+03M=; b=pOp2qRfTXTkJ0KODm+Rfr1kfNl3ed D5XPH96mBf358Jzs9KYB69sfGM6i33xmK3PLcTrjbb64zUzZD0R4JmbR9HPtHQ9R 7ozLcRgO5PgXYVCFSOmmHb7CsBYvBWTiDp94Isj4ZO1fvcWNFggAY/Vlgj/QuRBa Zum0fNSu7RNqC6pi0JCKz04DjwsOh41htTqdHp+N5JA1z1sGVKjGtL3XjLzu1PV8 afQnZruYilXNQIZdgks65nMeqro3fRskkUkzqYvHrAxBFE02jKGmiEkF7IPicIcG +LdoOlxs3OnUp/TPhSjCn3Wvzh1hd1RxEtmvFKP++dg+hKXekzULJSPrg== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvddufedgudduvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhepfhgfhffvvefuffgjkfggtgesth dtredttdertdenucfhrhhomhepufhtvghfrghnucftohgvshgthhcuoehshhhrseguvghv khgvrhhnvghlrdhioheqnecuggftrfgrthhtvghrnhepveelgffghfehudeitdehjeevhe dthfetvdfhledutedvgeeikeeggefgudeguedtnecuvehluhhsthgvrhfuihiivgeptden ucfrrghrrghmpehmrghilhhfrhhomhepshhhrhesuggvvhhkvghrnhgvlhdrihho X-ME-Proxy: Feedback-ID: i84614614:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 8 Mar 2023 17:23:40 -0500 (EST) References: <20230224044000.3084046-1-shr@devkernel.io> <20230224044000.3084046-2-shr@devkernel.io> <20230308164746.GA473363@cmpxchg.org> User-agent: mu4e 1.6.11; emacs 28.2.50 From: Stefan Roesch To: Johannes Weiner Cc: kernel-team@fb.com, linux-mm@kvack.org, riel@surriel.com, mhocko@suse.com, david@redhat.com, linux-kselftest@vger.kernel.org, linux-doc@vger.kernel.org, akpm@linux-foundation.org Subject: Re: [PATCH v3 1/3] mm: add new api to enable ksm per process Date: Wed, 08 Mar 2023 14:16:36 -0800 In-reply-to: <20230308164746.GA473363@cmpxchg.org> Message-ID: MIME-Version: 1.0 Content-Type: text/plain X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 640A52000C X-Stat-Signature: u4wnw9rm9nk84oo6kyxoedwnib1jibbn X-HE-Tag: 1678314222-786428 X-HE-Meta: U2FsdGVkX18uB6CbYhutGvT9/LhGv0eVeXUPlCGJlwioBkjlVIpVJ+lAJ9RjF5JYgxw6gG/S97GWCi4f2LDVFLj2XHnmL96frbLI3kNtBfD2lGNF3relYZuY4WX3I/SbUTNoN4svgm2e134BnpYwZnBKerT8vTinnHRIqI8TOk2/TCuBjLi2GcUMq2ORHGAph+85F2FAPziMFyetroORtI+7L/eXW7O93Tjx2/xhPBfHHWd5RC024AM5IBNQJIq7K+MgHRpysmal79SLP4NLWzJO4zilHd++H5dSe6oOPbcJ8bWVDiI+7PoJ/ZJqbuy9CdR+c/d5J++QG4PegIIeKsuV/QQC1MAUrjwG2om6nL7IvGAsszWj/9BpV5nQAFt60tXI6+KF7yfwnF74gVArrGOgjrBWS70hXNT5r7w/4NdT03EjnSHrN0ThJb+W9JCFgtcctk0JoVskwNJ0onIzDlHHUyJDlX5pRbewhBR19Cycz+fpzXAd5dyvn9O5/xtaeFE3ou0ZJAgCJmLPrXTW3s9DukMhMCvfCDRj9J+gK+CbB4kEuxTJ2+yAIhBNGVTx3LtSEFQsoB/n5Elw7thc9J4pgmOo+/iaQ6ne+qsmLuubPIYcoz8vhZYWOb7u98w6Munj+Mv9wdKzUZGrGMWinCZ3qTg1suqvQQ/wO26eb0Be4HyWF1CyIwVyCbuVOvJTyDm/VrT+0EqVVO0ZoZ/h6LuVCcamc2xKh1chPfY2pD45nY4EejNePp+6FuPF5Pu8/JaHL6rJNjMSpdCwNr1F84xi851AcLdDIOLAFivxTOUE2ya+lRS5Z3WI2ZZ/VeSfS9XAAIJAT6uyw5zuitW8kgkV6CZAq/B49du4nyZZ0r+KS/6XcHv3MlWTdnpPv12F20aw4uNJig0ll26O4ybwCH9IXcOmKzyAnN+mmC8gBckg0Xe371Kz1R6+DNEHJVJFa2TfCbmcjhYWhOfMzn8 7HW+AagZ yDU6YS5y/gqm/cnJl+JkZsKlJsfT63ZriGBLuYqKYLwilJUhguCxaso7gtRm9UTPi6mAH1xyu/gOc6TGdeOcWfaKo6RWEwYktCqakerPmi6aVsQGniLzRU9KLB0BJNAelEzWqHCfW2mKG7p7zETXpbOJxJm7R3CYIKZnW9lHLdQ0JzpII/76g80v51FHU1sQm9yuW X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Johannes Weiner writes: > On Thu, Feb 23, 2023 at 08:39:58PM -0800, Stefan Roesch wrote: >> This adds a new prctl to API to enable and disable KSM on a per process >> basis instead of only at the VMA basis (with madvise). >> >> 1) Introduce new MMF_VM_MERGE_ANY flag >> >> This introduces the new flag MMF_VM_MERGE_ANY flag. When this flag is >> set, kernel samepage merging (ksm) gets enabled for all vma's of a >> process. >> >> 2) add flag to __ksm_enter >> >> This change adds the flag parameter to __ksm_enter. This allows to >> distinguish if ksm was called by prctl or madvise. >> >> 3) add flag to __ksm_exit call >> >> This adds the flag parameter to the __ksm_exit() call. This allows to >> distinguish if this call is for an prctl or madvise invocation. >> >> 4) invoke madvise for all vmas in scan_get_next_rmap_item >> >> If the new flag MMF_VM_MERGE_ANY has been set for a process, iterate >> over all the vmas and enable ksm if possible. For the vmas that can be >> ksm enabled this is only done once. >> >> 5) support disabling of ksm for a process >> >> This adds the ability to disable ksm for a process if ksm has been >> enabled for the process. >> >> 6) add new prctl option to get and set ksm for a process >> >> This adds two new options to the prctl system call >> - enable ksm for all vmas of a process (if the vmas support it). >> - query if ksm has been enabled for a process. >> >> Signed-off-by: Stefan Roesch > > Hey Stefan, thanks for merging the patches into one. I found it much > easier to review. > > Overall this looks straight-forward to me. A few comments below: > >> @@ -2659,6 +2660,34 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3, >> case PR_SET_VMA: >> error = prctl_set_vma(arg2, arg3, arg4, arg5); >> break; >> +#ifdef CONFIG_KSM >> + case PR_SET_MEMORY_MERGE: >> + if (!capable(CAP_SYS_RESOURCE)) >> + return -EPERM; >> + >> + if (arg2) { >> + if (mmap_write_lock_killable(me->mm)) >> + return -EINTR; >> + >> + if (test_bit(MMF_VM_MERGEABLE, &me->mm->flags)) >> + error = -EINVAL; > > So if the workload has already madvised specific VMAs the > process-enablement will fail. Why is that? Shouldn't it be possible to > override a local decision from an outside context that has more > perspective on both sharing opportunities and security aspects? > > If there is a good reason for it, the -EINVAL should be addressed in > the manpage. And maybe add a comment here as well. > This makes sense, I'll remove the check above. >> + else if (!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags)) >> + error = __ksm_enter(me->mm, MMF_VM_MERGE_ANY); >> + mmap_write_unlock(me->mm); >> + } else { >> + __ksm_exit(me->mm, MMF_VM_MERGE_ANY); >> + } >> + break; >> + case PR_GET_MEMORY_MERGE: >> + if (!capable(CAP_SYS_RESOURCE)) >> + return -EPERM; >> + >> + if (arg2 || arg3 || arg4 || arg5) >> + return -EINVAL; >> + >> + error = !!test_bit(MMF_VM_MERGE_ANY, &me->mm->flags); >> + break; >> +#endif >> default: >> error = -EINVAL; >> break; >> diff --git a/mm/ksm.c b/mm/ksm.c >> index 56808e3bfd19..23d6944f78ad 100644 >> --- a/mm/ksm.c >> +++ b/mm/ksm.c >> @@ -1063,6 +1063,7 @@ static int unmerge_and_remove_all_rmap_items(void) >> >> mm_slot_free(mm_slot_cache, mm_slot); >> clear_bit(MMF_VM_MERGEABLE, &mm->flags); >> + clear_bit(MMF_VM_MERGE_ANY, &mm->flags); >> mmdrop(mm); >> } else >> spin_unlock(&ksm_mmlist_lock); >> @@ -2329,6 +2330,17 @@ static struct ksm_rmap_item *get_next_rmap_item(struct ksm_mm_slot *mm_slot, >> return rmap_item; >> } >> >> +static bool vma_ksm_mergeable(struct vm_area_struct *vma) >> +{ >> + if (vma->vm_flags & VM_MERGEABLE) >> + return true; >> + >> + if (test_bit(MMF_VM_MERGE_ANY, &vma->vm_mm->flags)) >> + return true; >> + >> + return false; >> +} >> + >> static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) >> { >> struct mm_struct *mm; >> @@ -2405,8 +2417,20 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) >> goto no_vmas; >> >> for_each_vma(vmi, vma) { >> - if (!(vma->vm_flags & VM_MERGEABLE)) >> + if (!vma_ksm_mergeable(vma)) >> continue; >> + if (!(vma->vm_flags & VM_MERGEABLE)) { > > IMO, the helper obscures the interaction between the vma flag and the > per-process flag here. How about: > > if (!(vma->vm_flags & VM_MERGEABLE)) { > if (!test_bit(MMF_VM_MERGE_ANY, &vma->vm_mm->flags)) > continue; > > /* > * With per-process merging enabled, have the MM scan > * enroll any existing and new VMAs on the fly. > * > ksm_madvise(); > } > >> + unsigned long flags = vma->vm_flags; >> + >> + /* madvise failed, use next vma */ >> + if (ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_MERGEABLE, &flags)) >> + continue; >> + /* vma, not supported as being mergeable */ >> + if (!(flags & VM_MERGEABLE)) >> + continue; >> + >> + vm_flags_set(vma, VM_MERGEABLE); > > I don't understand the local flags. Can't it pass &vma->vm_flags to > ksm_madvise()? It'll set VM_MERGEABLE on success. And you know it > wasn't set before because the whole thing is inside the !set > branch. The return value doesn't seem super useful, it's only the flag > setting that matters: > > ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_MERGEABLE, &vma->vm_flags); > /* madvise can fail, and will skip special vmas (pfnmaps and such) */ > if (!(vma->vm_flags & VM_MERGEABLE)) > continue; > vm_flags is defined as const. I cannot pass it directly inside the function, this is the reason, I'm using a local variable for it. >> + } >> if (ksm_scan.address < vma->vm_start) >> ksm_scan.address = vma->vm_start; >> if (!vma->anon_vma) >> @@ -2491,6 +2515,7 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page) >> >> mm_slot_free(mm_slot_cache, mm_slot); >> clear_bit(MMF_VM_MERGEABLE, &mm->flags); >> + clear_bit(MMF_VM_MERGE_ANY, &mm->flags); >> mmap_read_unlock(mm); >> mmdrop(mm); >> } else { > >> @@ -2664,12 +2690,39 @@ int __ksm_enter(struct mm_struct *mm) >> return 0; >> } >> >> -void __ksm_exit(struct mm_struct *mm) >> +static void unmerge_vmas(struct mm_struct *mm) >> +{ >> + struct vm_area_struct *vma; >> + struct vma_iterator vmi; >> + >> + vma_iter_init(&vmi, mm, 0); >> + >> + mmap_read_lock(mm); >> + for_each_vma(vmi, vma) { >> + if (vma->vm_flags & VM_MERGEABLE) { >> + unsigned long flags = vma->vm_flags; >> + >> + if (ksm_madvise(vma, vma->vm_start, vma->vm_end, MADV_UNMERGEABLE, &flags)) >> + continue; >> + >> + vm_flags_clear(vma, VM_MERGEABLE); > > ksm_madvise() tests and clears VM_MERGEABLE, so AFAICS > > for_each_vma(vmi, vma) > ksm_madvise(); > > should do it... > This is the same problem. vma->vm_flags is defined as const. + if (vma->vm_flags & VM_MERGEABLE) { This will be removed. >> + } >> + } >> + mmap_read_unlock(mm); >> +} >> + >> +void __ksm_exit(struct mm_struct *mm, int flag) >> { >> struct ksm_mm_slot *mm_slot; >> struct mm_slot *slot; >> int easy_to_free = 0; >> >> + if (!(current->flags & PF_EXITING) && flag == MMF_VM_MERGE_ANY && >> + test_bit(MMF_VM_MERGE_ANY, &mm->flags)) { >> + clear_bit(MMF_VM_MERGE_ANY, &mm->flags); >> + unmerge_vmas(mm); > > ...and then it's short enough to just open-code it here and drop the > unmerge_vmas() helper.