From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC1A5C32793 for ; Wed, 18 Jan 2023 10:17:30 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B5936B0071; Wed, 18 Jan 2023 05:17:30 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 365686B0072; Wed, 18 Jan 2023 05:17:30 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22D696B0074; Wed, 18 Jan 2023 05:17:30 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 148E16B0071 for ; Wed, 18 Jan 2023 05:17:30 -0500 (EST) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id CC1D11A0B39 for ; Wed, 18 Jan 2023 10:17:29 +0000 (UTC) X-FDA: 80367517818.07.5932D2C Received: from mail-pl1-f176.google.com (mail-pl1-f176.google.com [209.85.214.176]) by imf16.hostedemail.com (Postfix) with ESMTP id 2C332180004 for ; Wed, 18 Jan 2023 10:17:27 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=VsbaErPg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of isaku.yamahata@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=isaku.yamahata@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1674037048; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=TRTfZL3StJ9hOSfK7f1T1cO6UgiEkghPJ6emW6oB5WM=; b=8AaLNjHX72cq7aDX4bHHf7Aygj58Vbu3TZZDR1G/EsYmozl3Lyaljs9UmuJ5EPFn0ynF7z gM/0EgxNV1wa5MDK9h6pA1ekhtgWyUb3FOrW1aR+18aCVlhZV+Cci3WahodynQEgZVxvL2 axqecnRLbVCmKsdd0rc8wF07OjoaLeM= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=VsbaErPg; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf16.hostedemail.com: domain of isaku.yamahata@gmail.com designates 209.85.214.176 as permitted sender) smtp.mailfrom=isaku.yamahata@gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1674037048; a=rsa-sha256; cv=none; b=0UbM9KlOV60Y0KhsX7MkFGJ6J6RRXTZqVOFaBGYTyvmAwSM9GBGD44sQ1yNhdu5lkZvkIT RohEeoXDF6w+OIKVY/b+Lc7HayZ/YHTMYNWpTbNphnZoA5rJ9pyvGswsM2chv10FuPPwHe BYbu6JqprNOCQD786Ie42yO8Xkhp4UA= Received: by mail-pl1-f176.google.com with SMTP id y1so36426659plb.2 for ; Wed, 18 Jan 2023 02:17:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=TRTfZL3StJ9hOSfK7f1T1cO6UgiEkghPJ6emW6oB5WM=; b=VsbaErPgHAtxesL07pT4eodeuGg/45NGQD74kOyycyZvMN11AFd1R05/VQA6pOEOrm +zxwrYqU4AmnxiqH+8m27DsZxy6AK2lyLEoY2L28M4Bd8yBmCYgHh6qqWFb9tkmS0Upk DFoOiBxzL8mQ3cryHK7mfGOVnB1y4dNKYc8PsriiEHN6yiVjRMiohggmLADpAupAIHf3 Yi2HIfGfi3OQhsL8QSI6wGsIOnLtWXdGoWzgG1MLLtQqob9avNYEoLlwUaOlAi98M2Ib MHSmmsmJvAhaAEurhrEnNaSQnZ90EywrUjC0wuH5ucqXTadVa/72xCkzwOOrmjaqXObR zxmw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=TRTfZL3StJ9hOSfK7f1T1cO6UgiEkghPJ6emW6oB5WM=; b=uXYjreUeUsNKZN0pVYFTnRBXTrM5S881WV8kwlNg5zTOLoyBsHvMry0yUaXzIPMIU1 cVyv82vqeHVCPz/w51a7TJVfGnFG6eBUF78+Whux5UveGaygH0RyGO42KDIaZRVtUHPb g/4aVyp57XF4hOnmfBp27OSygzHwlqEpdH3ZPMAFOBNy8j3kzn839EWJEwzamU0mUwMb tUugGhKX68pki8KDykf6SyuxBIo9BeghcvbW+MVZyaPgA4r2xLEEqxhxCSFObNnTDlAw el+jAawCuxZr2yx43nGWHgZLNkaVP+jvIKPr4cPhplAUQbefFGBOTcFl7pnXjcBPSI8p wDMw== X-Gm-Message-State: AFqh2krdrnnuejnqKr0kKPqlqeljn9V1CQvLlK/VgVMlbi1a44aY5v/t jDZOsN2my7qmcOzXF4wp8qw= X-Google-Smtp-Source: AMrXdXtImpvJaJmae2KKbZNPInmobBwC1DDHiHeiMYJcL206cw9Ph82inhT00Ujzu8CjORCOM7ER7Q== X-Received: by 2002:a17:902:ce82:b0:194:84ef:5f9c with SMTP id f2-20020a170902ce8200b0019484ef5f9cmr22922601plg.29.1674037046738; Wed, 18 Jan 2023 02:17:26 -0800 (PST) Received: from localhost ([192.55.54.55]) by smtp.gmail.com with ESMTPSA id o10-20020a1709026b0a00b001898ee9f723sm22814641plk.2.2023.01.18.02.17.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jan 2023 02:17:26 -0800 (PST) Date: Wed, 18 Jan 2023 02:17:23 -0800 From: Isaku Yamahata To: Chao Peng Cc: Sean Christopherson , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-arch@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Arnd Bergmann , Naoya Horiguchi , Miaohe Lin , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, david@redhat.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , tabba@google.com, Michael Roth , mhocko@suse.com, wei.w.wang@intel.com, isaku.yamahata@gmail.com Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Message-ID: <20230118101723.GA2976263@ls.amr.corp.intel.com> References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <20230117124107.GA273037@chaop.bj.intel.com> <20230118081641.GA303785@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20230118081641.GA303785@chaop.bj.intel.com> X-Rspamd-Queue-Id: 2C332180004 X-Rspamd-Server: rspam09 X-Rspam-User: X-Stat-Signature: mh7jm1kfq5xia164a5bsto8taf7bteou X-HE-Tag: 1674037047-172176 X-HE-Meta: U2FsdGVkX1/NqRZylb4hzioUxy0s+XPSf6PSQoRuXOFI5ElrKAOTwpVSBgsfNhlYsWROdgj5bYEw/R7vIJ3Brdp37jsc/p50UBnEreYyQkY/msSojiHcjQnQiQs91IPLWKHegzx5gft5ampJ6tQKQ5szUeuxAoPMrQKvnrLxIxPFvDnfG2yOWWMWiCyEIpa9wrSZ6wF/7clb8pzvWuijbkYeud8qp1IPaTiTyWpa9i91wC29zdcK2xSkcE/ub1j83gGZSsuycqXybm4OBYKiRP3zgoEeF3xKvULH67XWUFP7cCfImOT8sHstLZyMpJuXVtHpCw3UbdA2AjkXkMNU6y+vryGWtBtgzZoUR1TKc1LJRgt/5TW9Gsui3g+57AIWxeBrwnfKJuKW1WcwSPl97R6Bkg+/ZhURJIVbEZId6XunXgJR9eaHmn7vjgYWF6cSRGBazbapnMJn3PqiTishNscGZR5g4yntoQoJYTsOM479ojaY/6cS116w/rh/97vPGrkcn6Rhd7943v0252lFeE6a/FKugyZ8nyewd8x0oDSl/Ii/Lhv5D8UpanK9W8kuNEKI7CR5uuNwLn1RIhH90qxanYjEPTgUoDkzOdUxYhzZsoGWE0cHDUOOQr1es02Zxqx+uynJLW0MmnB+kdNggmrLhwEGlTkfZNHpLbnym2efUp6ncF0oe/iDmuvjZWzgGRBrbsY+BIESgyUOkXmFGBkrcLiBOx+iu6WRxLtyglf8ODf+pM0FOmgLgRvkLysF89nXsGhQPO+xbeXxsGCdBskKq8HQc/0AAc1J6DSbd1R2fQuGTFBKqq1dHhFoHkdA3NvbiAzhs3jp2U9Gvl0qTJ2WwIvijMT58iHo0R4Diip/XZW3vdXJJewdp167ryj1AMZoVyHfH5HWHMtEC5YGr22DFM1rWaL/GQjz40eoSxf/6AdpsCgW7r0ky6uNPY1kZNqm72A0utNCG5UYgZ6 ava2TgEr nd4smOJM2hXF1k2glKPM6P3ee03nEs0JiCAxuv464nlJC99MeHhVjgzAMmmOFV9WfFtNG1k9jRSiK8UqSiKMDOiB3vf9/wMF6evFNAiy5ufUHIQoheR+L15/hzKdFueLzMrU08yt/ed1pJM/w8CS02AnJkoCP8BbrC1QjIDlQXQoESSkMmVxQxuXjsby9cZp4wLtuNmDSVqq63VQ= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Jan 18, 2023 at 04:16:41PM +0800, Chao Peng wrote: > On Tue, Jan 17, 2023 at 04:34:15PM +0000, Sean Christopherson wrote: > > On Tue, Jan 17, 2023, Chao Peng wrote: > > > On Fri, Jan 13, 2023 at 09:54:41PM +0000, Sean Christopherson wrote: > > > > > + list_for_each_entry(notifier, &data->notifiers, list) { > > > > > + notifier->ops->invalidate_start(notifier, start, end); > > > > > > > > Two major design issues that we overlooked long ago: > > > > > > > > 1. Blindly invoking notifiers will not scale. E.g. if userspace configures a > > > > VM with a large number of convertible memslots that are all backed by a > > > > single large restrictedmem instance, then converting a single page will > > > > result in a linear walk through all memslots. I don't expect anyone to > > > > actually do something silly like that, but I also never expected there to be > > > > a legitimate usecase for thousands of memslots. > > > > > > > > 2. This approach fails to provide the ability for KVM to ensure a guest has > > > > exclusive access to a page. As discussed in the past, the kernel can rely > > > > on hardware (and maybe ARM's pKVM implementation?) for those guarantees, but > > > > only for SNP and TDX VMs. For VMs where userspace is trusted to some extent, > > > > e.g. SEV, there is value in ensuring a 1:1 association. > > > > > > > > And probably more importantly, relying on hardware for SNP and TDX yields a > > > > poor ABI and complicates KVM's internals. If the kernel doesn't guarantee a > > > > page is exclusive to a guest, i.e. if userspace can hand out the same page > > > > from a restrictedmem instance to multiple VMs, then failure will occur only > > > > when KVM tries to assign the page to the second VM. That will happen deep > > > > in KVM, which means KVM needs to gracefully handle such errors, and it means > > > > that KVM's ABI effectively allows plumbing garbage into its memslots. > > > > > > It may not be a valid usage, but in my TDX environment I do meet below > > > issue. > > > > > > kvm_set_user_memory AddrSpace#0 Slot#0 flags=0x4 gpa=0x0 size=0x80000000 ua=0x7fe1ebfff000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#1 flags=0x4 gpa=0xffc00000 size=0x400000 ua=0x7fe271579000 ret=0 > > > kvm_set_user_memory AddrSpace#0 Slot#2 flags=0x4 gpa=0xfeda0000 size=0x20000 ua=0x7fe1ec09f000 ret=-22 > > > > > > Slot#2('SMRAM') is actually an alias into system memory(Slot#0) in QEMU > > > and slot#2 fails due to below exclusive check. > > > > > > Currently I changed QEMU code to mark these alias slots as shared > > > instead of private but I'm not 100% confident this is correct fix. > > > > That's a QEMU bug of sorts. SMM is mutually exclusive with TDX, QEMU shouldn't > > be configuring SMRAM (or any SMM memslots for that matter) for TDX guests. > > Thanks for the confirmation. As long as we only bind one notifier for > each address, using xarray does make things simple. In the past, I had patches for qemu to disable PAM and SMRAM, but they were dropped for simplicity because SMRAM/PAM are disabled as reset state with unused memslot registered. TDX guest bios(TDVF or EDK2) doesn't enable them. Now we can revive them. -- Isaku Yamahata