From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7CDC6C433F5 for ; Tue, 12 Apr 2022 21:28:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DAE5F6B0072; Tue, 12 Apr 2022 17:28:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D37AF6B0073; Tue, 12 Apr 2022 17:28:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B8ADB6B0074; Tue, 12 Apr 2022 17:28:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.25]) by kanga.kvack.org (Postfix) with ESMTP id A3D1E6B0072 for ; Tue, 12 Apr 2022 17:28:20 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 73D9E21156 for ; Tue, 12 Apr 2022 21:28:20 +0000 (UTC) X-FDA: 79349515560.11.CE4D91A Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by imf17.hostedemail.com (Postfix) with ESMTP id 9D19A40003 for ; Tue, 12 Apr 2022 21:28:19 +0000 (UTC) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id F38DCB82051; Tue, 12 Apr 2022 21:28:17 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id A0F4CC385A1; Tue, 12 Apr 2022 21:28:15 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1649798896; bh=7S+oixBL7sX4olcPvEogXh1CnNUMKSWXUwyI+LxpXRE=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=VeaZBY812RM8tTVwEPyJ9MBmgJnUTNE3sOrPD+M+bRRQd4EXv4xZNTVDRWsnHQnwQ nuEGf6ZS9C+ddtULVHoelJ+Sys31+fgbgfzbPl7GYMkrHkx+KbamhWBM51zUmpa7uC Nz3+aaac1rGMVm9tdy/qnGNhTl0Owq87IWjWQ2D1s4XMov8f45oX8L5UFp5j14fKB1 fUi8j32KOY9GS5oSTtfMNRglZuQ3KRzJIVrSoHc2j3VWCyxbo7QDEh3cyg9pjOTG/b pJpnm81PNU1smIdqbJ2fvppy3P5h96qGmyCkUvdLdvoFWbMMEm49cfILB22CFnvhk1 qy5OI5cnCaJyQ== Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailauth.nyi.internal (Postfix) with ESMTP id 7946527C0054; Tue, 12 Apr 2022 17:28:14 -0400 (EDT) Received: from imap48 ([10.202.2.98]) by compute2.internal (MEProxy); Tue, 12 Apr 2022 17:28:14 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvvddrudekkedgudeitdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefofgggkfgjfhffhffvufgtsehttdertderredtnecuhfhrohhmpedftehn ugihucfnuhhtohhmihhrshhkihdfuceolhhuthhosehkvghrnhgvlhdrohhrgheqnecugg ftrfgrthhtvghrnheptdfhheettddvtedvtedugfeuuefhtddugedvleevleefvdetleff gfefvdekgeefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehmrghilhhfrh homheprghnugihodhmvghsmhhtphgruhhthhhpvghrshhonhgrlhhithihqdduudeiudek heeifedvqddvieefudeiiedtkedqlhhuthhopeepkhgvrhhnvghlrdhorhhgsehlihhnuh igrdhluhhtohdruhhs X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 8882521E0073; Tue, 12 Apr 2022 17:28:12 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.7.0-alpha0-386-g4174665229-fm-20220406.001-g41746652 Mime-Version: 1.0 Message-Id: <6f44ddf9-6755-4120-be8b-7a62f0abc0e0@www.fastmail.com> In-Reply-To: <20220412143636.GG64706@ziepe.ca> References: <20220310140911.50924-1-chao.p.peng@linux.intel.com> <20220310140911.50924-5-chao.p.peng@linux.intel.com> <02e18c90-196e-409e-b2ac-822aceea8891@www.fastmail.com> <7ab689e7-e04d-5693-f899-d2d785b09892@redhat.com> <20220412143636.GG64706@ziepe.ca> Date: Tue, 12 Apr 2022 14:27:52 -0700 From: "Andy Lutomirski" To: "Jason Gunthorpe" , "David Hildenbrand" Cc: "Sean Christopherson" , "Chao Peng" , "kvm list" , "Linux Kernel Mailing List" , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, "Linux API" , qemu-devel@nongnu.org, "Paolo Bonzini" , "Jonathan Corbet" , "Vitaly Kuznetsov" , "Wanpeng Li" , "Jim Mattson" , "Joerg Roedel" , "Thomas Gleixner" , "Ingo Molnar" , "Borislav Petkov" , "the arch/x86 maintainers" , "H. Peter Anvin" , "Hugh Dickins" , "Jeff Layton" , "J . Bruce Fields" , "Andrew Morton" , "Mike Rapoport" , "Steven Price" , "Maciej S . Szmigiero" , "Vlastimil Babka" , "Vishal Annapurve" , "Yu Zhang" , "Kirill A. Shutemov" , "Nakajima, Jun" , "Dave Hansen" , "Andi Kleen" , "Eric W. Biederman" , "Linux API" Subject: Re: [PATCH v5 04/13] mm/shmem: Restrict MFD_INACCESSIBLE memory against RLIMIT_MEMLOCK Content-Type: text/plain Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=VeaZBY81; spf=pass (imf17.hostedemail.com: domain of luto@kernel.org designates 145.40.68.75 as permitted sender) smtp.mailfrom=luto@kernel.org; dmarc=pass (policy=none) header.from=kernel.org X-Rspam-User: X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 9D19A40003 X-Stat-Signature: 11g3h4sgqtc3er8hynw4dai7zsy9nmty X-HE-Tag: 1649798899-147548 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Apr 12, 2022, at 7:36 AM, Jason Gunthorpe wrote: > On Fri, Apr 08, 2022 at 08:54:02PM +0200, David Hildenbrand wrote: > >> RLIMIT_MEMLOCK was the obvious candidate, but as we discovered int he >> past already with secretmem, it's not 100% that good of a fit (unmovable >> is worth than mlocked). But it gets the job done for now at least. > > No, it doesn't. There are too many different interpretations how > MELOCK is supposed to work > > eg VFIO accounts per-process so hostile users can just fork to go past > it. > > RDMA is per-process but uses a different counter, so you can double up > > iouring is per-user and users a 3rd counter, so it can triple up on > the above two > >> So I'm open for alternative to limit the amount of unmovable memory we >> might allocate for user space, and then we could convert seretmem as well. > > I think it has to be cgroup based considering where we are now :\ > So this is another situation where the actual backend (TDX, SEV, pKVM, pure software) makes a difference -- depending on exactly what backend we're using, the memory may not be unmoveable. It might even be swappable (in the potentially distant future). Anyway, here's a concrete proposal, with a bit of handwaving: We add new cgroup limits: memory.unmoveable memory.locked These can be set to an actual number or they can be set to the special value ROOT_CAP. If they're set to ROOT_CAP, then anyone in the cgroup with capable(CAP_SYS_RESOURCE) (i.e. the global capability) can allocate movable or locked memory with this (and potentially other) new APIs. If it's 0, then they can't. If it's another value, then the memory can be allocated, charged to the cgroup, up to the limit, with no particular capability needed. The default at boot is ROOT_CAP. Anyone who wants to configure it differently is free to do so. This avoids introducing a DoS, makes it easy to run tests without configuring cgroup, and lets serious users set up their cgroups. Nothing is charge per mm. To make this fully sensible, we need to know what the backend is for the private memory before allocating any so that we can charge it accordingly.