From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CAC43C77B7F for ; Fri, 19 May 2023 15:06:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3B0ED900004; Fri, 19 May 2023 11:06:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 36028900003; Fri, 19 May 2023 11:06:44 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 22826900004; Fri, 19 May 2023 11:06:44 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 12F06900003 for ; Fri, 19 May 2023 11:06:44 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id AB991140A49 for ; Fri, 19 May 2023 15:06:43 +0000 (UTC) X-FDA: 80807331486.28.115A7A1 Received: from mail-yw1-f180.google.com (mail-yw1-f180.google.com [209.85.128.180]) by imf04.hostedemail.com (Postfix) with ESMTP id AD1E04003D for ; Fri, 19 May 2023 15:04:21 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=uk9LpPdT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684508661; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xDD6929/PsmGdeBcOFBnrXj2UmWCz3zY8a5EUQcTCVg=; b=U/dHlHhSCeSVtzuhgXY+92RMTQpudJpGKsXtGxsJGVdLg+Cztm0TXzbYvAfuMPjefUuCHA D1FgLAKsWa+NaDM+P2dzYeiqGXDrDn+8bERzHbOXG5ZkaU6uNIq4tx2R7Dd4xv+vCEGXat NhhMR0WjLhkHVxMmdJVyN6oazxRo96E= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=uk9LpPdT; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf04.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.180 as permitted sender) smtp.mailfrom=jiaqiyan@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684508661; a=rsa-sha256; cv=none; b=qUVluPCCa2+McV8AMQvpORuL9cpGd3jxSgCIhkmhLfYtRf89fAu2aAFbJoaH+EueXwgP1u ln013m7gqyPVhj+WTJeInequDPDdC8S7x0r7llNQKWNqOzGw6wxwJjj32xUOrEo1LtrISl qgF5fwWzQ2L+sJSt0Szw8gvZAXKT6Rc= Received: by mail-yw1-f180.google.com with SMTP id 00721157ae682-561a7d96f67so45657567b3.3 for ; Fri, 19 May 2023 08:04:21 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684508660; x=1687100660; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=xDD6929/PsmGdeBcOFBnrXj2UmWCz3zY8a5EUQcTCVg=; b=uk9LpPdT+v2YnySfZPHNvYFvP3XUZObHdokf+Lhsg8rrmlV/AHtuG53/0mNLuBdZgy Hlh7t9+alYP5vYECOdZn0X8J1vZJp1SkgoGQazEHdxv9yRmvzSuoGuftV5jBXPunUnPh D1kZTpNnxcW/q6J3E56ZtQPCEBGolXzO0uCzxehQLHRW+hvZl/FCYIJlaXKgj66OsMV1 n4nG2YMwBgfStXO9hU+KOSt2o2LjjWSZFVILD/aSA1pslgcT/LLQ6lk0s9dwNuRcGEbU xoTD7zAwLmXQwxzFK6x/uiPxj7PqIfvGefbrWIPKQYGNmZOI2aX5lWG/sj+eNYITnJBu 0v6g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684508660; x=1687100660; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=xDD6929/PsmGdeBcOFBnrXj2UmWCz3zY8a5EUQcTCVg=; b=WRQdo/XJC/+3gYpKTj2JpNt6jbW67srF50tQA94iw9Z4FmZqe1RpQd4/1920pEaPmf 5XDe6PcTQmk9Qcmyv3t4hSTQZ50nXVWXAU2pUCbZSZ4A5F6QfAfglcKZ2BIFIce75ne7 nprrt7qOjbGdz1DD3jkcpz7xlCHapxOlgF/szB3aQAHVCW8ipIy6uaUUW2av2keOxyKY 6G/ixmK/qPDvT1BgiVLH3RYjPGa8CBqjJi3USAvOkYHhv23OsiZ5TcgUhYcqydRlQ57v p7IIbuskhCkWkjfxacCYN8//vil5rYTiVF5OOfZ+6U+D3M2QPUMUnMFzxaGLGPB0rNKf QQww== X-Gm-Message-State: AC+VfDyAzvrBO3f3SH936PmOaxlBhcCztVJmrbxlWh3o0fewy0MCzDIq nWwfa5ertaE86316a6vnn7M36CzlQT7+kP9JiUEiDombWUhAjw+VOY55IOEG X-Google-Smtp-Source: ACHHUZ6hgP8/+eeHDFwygDimQd8NE4HBu+PZcJJxwPp6d7y9EoK810kGjbxXl/SD8C3mvZ+ZbsSYsPmkkMXGf2kVPHU= X-Received: by 2002:a0d:f407:0:b0:559:ea89:7c2c with SMTP id d7-20020a0df407000000b00559ea897c2cmr1903416ywf.33.1684508660474; Fri, 19 May 2023 08:04:20 -0700 (PDT) MIME-Version: 1.0 References: <20230511182426.1898675-1-axelrasmussen@google.com> <32fdc2c8-b86b-92f3-1d5e-64db6be29126@redhat.com> In-Reply-To: <32fdc2c8-b86b-92f3-1d5e-64db6be29126@redhat.com> From: Jiaqi Yan Date: Fri, 19 May 2023 08:04:09 -0700 Message-ID: Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl To: Peter Xu Cc: Axel Rasmussen , David Hildenbrand , James Houghton , Alexander Viro , Andrew Morton , Christian Brauner , Hongchen Zhang , Huang Ying , "Liam R. Howlett" , Miaohe Lin , "Mike Rapoport (IBM)" , Nadav Amit , Naoya Horiguchi , Shuah Khan , ZhangPeng , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Anish Moorthy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: AD1E04003D X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: 1zqqq67jcijsw183x7rh7egst7sfzafg X-HE-Tag: 1684508661-970148 X-HE-Meta: U2FsdGVkX19a6m2Iux0uh+jDxfCrUiKwYKkI+b5L5oao0dcgU/mHehDJ5dxPNBfubvG1xd6zFK3bpFcZwLqgaw6C+4owdHtih/NzE0oMmn0aiG+xZo7cPcCqxTrJ2vp74epblQDzM2xuY5v9G2+ryUoCNkd/FiQI+184gKUw7nvt8FbXvez6u++uhBOxJx/S+Eu3DTP3pq5G0UVlvpv0CffMlGRT8g/HseJaTuvozZGyk3rq5F7PA9fGbVJwD7Pkuq5FnssQo0+1OMxfsZpfT0mngRlI/pBqAxAkQkLTQPZsz88zADfBPy5+pAxnDASd85Z0Y83VyY7oTMHt6RAagvVv4gxDGrDO4Zwo6tZxXYQ4ns8//UqZehXtMjZpACJ/TEEPJniP8X1bMEIoPNWTZyoYsGaHsuXokjvlYG+zXGU+D/oF18prOi7MFMSS32J8Gxy0vd7gtQEALHdSbtDkJNyI+6zMhBuy1VO4Y9DnuEXE4TBwHQFGsaO7sbQAmZozz+gVmZNIcEHxAnHfukq5hB9hjjV5JKNHUUw9knnLl4BZ3LjAJcw/4FUl+Mvga2yzKP4pYYJBfgDjq373Pl4JRoySDRNpI9dpZzgoImWNzzyHDwDi1wSqcd4W1gIGNFuZoXlvpXW0paZtC3tFxfE4qzC7NdcEh4DVeWoH16qjWnCIOw1FVDXJM634k8uQHtO366KYzl1o3Z4C58b3hcAQcosY40Qqe+eu/mhwG1Iye8JeSuH0gymxw2Byeprf/74f6jMK1pLbXgQLP3cxKyBMOtAmAM66Fdnd9I/HPXk2vHN7u1wGMXfmwFzaC70giEycGdGHI+5yGjAse/drhVxCF/nsfS43krZC0opIjMq/0+qZ/Mbw1ynen3mm/vInbTQSkb0ktHOXH5IZctdLNs3HFiXNDzYu/lRYTeYCb/k5RU3pu2QihINUl31hNyMCjrKVC2fWEvqCE9pSA89nV73 z/qwRgkU bQIPGztSKbniulsYaidTDz1V9ys+5uE21aXHVVtmOSIZ3UrGodNsD2OsqH+DY2sHc1fsT5ag3j+4fafb/3DaAKcSPtSr5+KzU/hTUus6TGn2jgXZYdMXtI1Lfd4Y6pUuUoiheFEmqzaHRw5UPXWwYcukEy/kDe5hrdOMzz63ja3LcQ2KaoMSOT/72QOSxX5eByf9BUCO60UDrMBYiF1KN2ndN0p1r+9vylZyjaVF+CHw5vu165gFMlF0osNKwyVIkPsAt9cLcgcBCvTzCpwEG7+qbvMYUKojwLjJxFOKFKKwVLRio3re7Qcj6POA78L0pngrTjT0ves2mDjLjwFDS79OERaGnzag1TL85kOq2tx5tRb/trlXkme+ugTezXRrthSoEGRYFgJLeuWQM8h9u94B4s2BEKQnrCvNnlykWhU3OzHZ5u7SS1AE5uQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, May 19, 2023 at 1:38=E2=80=AFAM David Hildenbrand wrote: > > On 18.05.23 22:38, Axel Rasmussen wrote: > > On Thu, May 18, 2023 at 9:05=E2=80=AFAM Peter Xu wr= ote: > >> > >> On Wed, May 17, 2023 at 05:43:53PM -0700, Jiaqi Yan wrote: > >>> On Wed, May 17, 2023 at 3:29=E2=80=AFPM Axel Rasmussen wrote: > >>>> > >>>> On Wed, May 17, 2023 at 3:20=E2=80=AFPM Peter Xu = wrote: > >>>>> > >>>>> On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > >>>>>> On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > >>>>>>> On Thu, May 11, 2023 at 11:24=E2=80=AFAM Axel Rasmussen > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> So the basic way to use this new feature is: > >>>>>>>> > >>>>>>>> - On the new host, the guest's memory is registered with userfau= ltfd, in > >>>>>>>> either MISSING or MINOR mode (doesn't really matter for this = purpose). > >>>>>>>> - On any first access, we get a userfaultfd event. At this point= we can > >>>>>>>> communicate with the old host to find out if the page was poi= soned. > >>>>>>>> - If so, we can respond with a UFFDIO_SIGBUS - this places a swa= p marker > >>>>>>>> so any future accesses will SIGBUS. Because the pte is now "p= resent", > >>>>>>>> future accesses won't generate more userfaultfd events, they'= ll just > >>>>>>>> SIGBUS directly. > >>>>>>> > >>>>>>> I want to clarify the SIGBUS mechanism here when KVM is involved, > >>>>>>> keeping in mind that we need to be able to inject an MCE into the > >>>>>>> guest for this to be useful. > >>>>>>> > >>>>>>> 1. vCPU gets an EPT violation --> KVM attempts GUP. > >>>>>>> 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS= . > >>>>>>> 3. KVM finds that GUP failed and returns -EFAULT. > >>>>>>> > >>>>>>> This is different than if GUP found poison, in which case KVM wil= l > >>>>>>> actually queue up a SIGBUS *containing the address of the fault*,= and > >>>>>>> userspace can use it to inject an appropriate MCE into the guest.= With > >>>>>>> UFFDIO_SIGBUS, we are missing the address! > >>>>>>> > >>>>>>> I see three options: > >>>>>>> 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I thin= k > >>>>>>> this is pointless. > >>>>>>> 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have= a > >>>>>>> UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SI= GBUS > >>>>>>> instead of VM_FAULT_RETRY. We will keep getting userfaults on rep= eated > >>>>>>> accesses, just like how we get repeated signals for real poison. > >>>>>>> 3. Use this in conjunction with the additional KVM EFAULT info th= at > >>>>>>> Anish proposed (the first part of [1]). > >>>>>>> > >>>>>>> I think option 3 is fine. :) > >>>>>> > >>>>>> Or... option 4) just to use either MADV_HWPOISON or hwpoison-injec= t? :) > >>>>> > >>>>> I just remember Axel mentioned this in the commit message, and just= in case > >>>>> this is why option 4) was ruled out: > >>>>> > >>>>> They expect that once poisoned, pages can never become > >>>>> "un-poisoned". So, when we live migrate the VM, we need to= preserve > >>>>> the poisoned status of these pages. > >>>>> > >>>>> Just to supplement on this point: we do have unpoison (echoing to > >>>>> "debug/hwpoison/hwpoison_unpoison"), or am I wrong? > >>> > >>> If I read unpoison_memory() correctly, once there is a real hardware > >>> memory corruption (hw_memory_failure will be set), unpoison will stop > >>> working and return EOPNOTSUPP. > >>> > >>> I know some cloud providers evacuating VMs once a single memory error > >>> happens, so not supporting unpoison is probably not a big deal for > >>> them. BUT others do keep VM running until more errors show up later, > >>> which could be long after the 1st error. > >> > >> We're talking about postcopy migrating a VM has poisoned page on src, > >> rather than on dst host, am I right? IOW, the dest hwpoison should be > >> fake. Yes, for this we are on the same page. The scenario I want to describe is..= . > >> > >> If so, then I would assume that's the case where all the pages on the = dest > >> host is still all good (so hw_memory_failure not yet set, or I doubt t= he ...target VM can get hw error anytime: before precopy (if cloud provider is not carefully monitoring the machine health), during precopy from src to target, during src blackout, during postcopy, after migration done, and keep running on host. Both MADV_HWPOISON[1] and hwpoison-inject[2] are subject to hw_memory_failure, so they just seems unreliable to me: if target is in memory error trouble before or in early phase of migration, we lose the unpoison feature in kernel. [1] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4= 786e008a/mm/madvise.c#L1130 [2] https://github.com/torvalds/linux/blob/2d1bcbc6cd703e64caf8df314e3669b4= 786e008a/mm/hwpoison-inject.c#L51 > >> judgement of being a migration target after all)? > >> > >> The other thing is even if dest host has hw poisoned page, I'm not sur= e > >> whether hw_memory_failure is the only way to solve this. > >> > >> I saw that this is something got worked on before from Zhenwei, David = used > >> to have some reasoning on why it was suggested like using a global kno= b: > >> > >> https://lore.kernel.org/all/d7927214-e433-c26d-7a9c-a291ced81887@redha= t.com/ > >> > >> Two major issues here afaics: > >> > >> - Zhenwei's approach only considered x86 hwpoison - it relies on kp= te > >> having !present in entries but that's x86 specific rather than ge= neric > >> to memory_failure.c. > >> > >> - It is _assumed_ that hwpoison injection is for debugging only. > >> > >> I'm not sure whether you can fix 1) by some other ways, e.g., what if = the > >> host just remember all the hardware poisoned pfns (or remember > >> soft-poisoned ones, but then here we need to be careful on removing th= em > >> from the list when it's hwpoisoned for real)? It sounds like there's > >> opportunity on providing a generic solution rather than relying on > >> !pte_present(). > >> > >> For 2) IMHO that's not a big issue, you can declare it'll be used in != debug > >> but production systems so as to boost the feature importance with a re= al > >> use case. > >> > >> So far I'd say it'll be great to leverage what it's already there in l= inux > >> and make it as generic as possible. The only issue is probably > >> CAP_ADMIN... not sure whether we can have some way to provide !ADMIN > >> somehow, or you can simply work around this issue. I don't think CAP_ADMIN is something we can work around: a VMM must be a good citizen to avoid introducing any vulnerability to the host or guest. On the other hand, "Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do." [3]. I am not familiar with the UFFD internals, but our use case seems to match what UFFD wants to provide: without affecting the whole world, give a specific userspace (without CAP_ADMIN) the ability to handle page faults (indirectly emulate a HWPOISON page (in my mind I treat it as SetHWPOISON(page) + TestHWPOISON(page) operation in kernel's PF code)). So is it fair to say what Axel provided here is "provide !ADMIN somehow"? [3]https://docs.kernel.org/admin-guide/mm/userfaultfd.html > > > > As you mention below I think the key distinction is the scope - I > > think MADV_HWPOISON affects the whole system, including other > > processes. > > > > For our purposes, we really just want to "poison" this particular > > virtual address (the HVA, from the VM's perspective), not even other > > mappings of the same shared memory. I think that behavior is different > > from MADV_HWPOISON, at least. > > MADV_HWPOISON really is the wrong interface to use. See "man madvise". > > We don't want to allow arbitrary users to hwpoison+offline absolutely > healthy physical memory, which is what MADV_HWPOISON is all about. > > As you say, we want to turn an unpopulated (!present) virtual address to > mimic like we had a MCE on a page that would have been previously mapped > here: install a hwpoison marker without actually poisoning any present > page. In fact, we'd even want to fail if there *is* something mapped. > > Sure, one could teach MADV_HWPOISON to allow unprivileged users to do > that for !present PTE entries, and fail for unprivileged users if there > is a present PTE entry. I'm not sure if that's the cleanest approach, > though, and a new MADV as suggested in this thread would eventually be > cleaner. > > -- > Thanks, > > David / dhildenb >