From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3EA4FC7EE23 for ; Thu, 18 May 2023 00:44:09 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E6F6280001; Wed, 17 May 2023 20:44:08 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 69673900003; Wed, 17 May 2023 20:44:08 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 51045280001; Wed, 17 May 2023 20:44:08 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 3BF28900003 for ; Wed, 17 May 2023 20:44:08 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id EDCE51405E9 for ; Thu, 18 May 2023 00:44:06 +0000 (UTC) X-FDA: 80801528892.12.7B545B5 Received: from mail-yw1-f173.google.com (mail-yw1-f173.google.com [209.85.128.173]) by imf18.hostedemail.com (Postfix) with ESMTP id 1F9EC1C0003 for ; Thu, 18 May 2023 00:44:04 +0000 (UTC) Authentication-Results: imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hTiqWTp2; spf=pass (imf18.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.173 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684370645; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=eXKMOzqC/UtqsX5IYZEI7ZxYrpvsnXCXl/LYrRajGXI=; b=1oWCWIzImIoa7GmEPHQAqXSg9bh/JfrRKIC5556H9i5UAGAaWYAWokJawFcDcHEBemT6+k slVgD0OewC6Lh3qsFR4bXh3kH4lMgO3dkA2vtxk6G/RM9ENiuIh8lIAUR71iOBvMawOGWE RlMVEXuKl2aWGPzfs79Ewa2UQI2Wx/8= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684370645; a=rsa-sha256; cv=none; b=KTXr0+VSchWh9KX+bpIXTuMgrfk8Tma0Ky7dIFawCQIPr6lKP+kAW1QYcKTULbwNLqAzFf MVwjY2DI3VvZZnUM4kvnBvUZPaYepPv62EiGiRZyIQKKdzid9BOPZ4gV550PSbzs08+4Qg 8bw8O/UMp0msP4YakogFdf65ZDgxwik= ARC-Authentication-Results: i=1; imf18.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=hTiqWTp2; spf=pass (imf18.hostedemail.com: domain of jiaqiyan@google.com designates 209.85.128.173 as permitted sender) smtp.mailfrom=jiaqiyan@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f173.google.com with SMTP id 00721157ae682-561b43fc896so13652597b3.0 for ; Wed, 17 May 2023 17:44:04 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684370644; x=1686962644; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=eXKMOzqC/UtqsX5IYZEI7ZxYrpvsnXCXl/LYrRajGXI=; b=hTiqWTp2Vbh4DYbynTTuVhSnOKuJ+ybN0yY8fDHNfQZ0Kpooe2O9U2QOat9axli8+E IRILo5iXuDLZMsc0Yxnqz2eG+3Bar9+xuR+DlYraB6e3F8k9+kr3TPEJsxEfQNZGmy9x BllTdLin9RAI6H3BHokLJYGVzBbe85vmfzjbyQF3F1PLf/gDGm4EYVUy61RWUmm1LItq qB25bUthFrAneMvx9f9RCeTY0/2TB1UAh7Ki48NE1UcQ7d+L+R3gcS82jZJ17fWf4Why i4EDqnHdimZpkDj27A+nwDuY8AoeJaikYp7XT+Z+xtHKxA6ETMiHE7ezR6BKhS2Yycjb xpKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684370644; x=1686962644; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=eXKMOzqC/UtqsX5IYZEI7ZxYrpvsnXCXl/LYrRajGXI=; b=Fp63+mrsfvXrRPvpRHWAJ5wh30pozJL6kRFO+3QXkmNnrJk4nTpgW0HOOsdRap8T+9 moRUgR7z1uhixtaCe/dJJGeTgla6YSTv115Inzximk/u5fDV6w80jcx8cJK9Bk92m7Nj agkFtyM8GjdFaU48EtjeE6RQfoKYWnOR8g6IYDd1XRD3F+PTUi18W/pkAUnQm1KRqE9q pzAnrkitnUPcbOdG4GrqQMzWaL6LOJ8qrqffqWif+H5B6nV/H3P8gH+bAz4wqR1xF6+0 YIHxpMOdKz9sIwBbfhOR63FOLq0qmeJlZIyjxxjZ58mVFCWszSmbIPYXXlWFtVOzaa9N GaJQ== X-Gm-Message-State: AC+VfDxV4Mn/mWwYuT3OTdtx6TznKQQmz+Dws1blNkw8dJPWabPvn4hj /uoGZv75jjj8BvkT5rV2JTejqWrQ2mBTiJtSFIyFPA== X-Google-Smtp-Source: ACHHUZ6B0Jo5j3y6tRqThQCA05429gGG0JHEdt0NDntdUD9hG2VL1i0ANB8FRTNz2T9txPnkqsAnd5ReopjfsnFvS6k= X-Received: by 2002:a81:5257:0:b0:561:beec:89d3 with SMTP id g84-20020a815257000000b00561beec89d3mr40844ywb.6.1684370644059; Wed, 17 May 2023 17:44:04 -0700 (PDT) MIME-Version: 1.0 References: <20230511182426.1898675-1-axelrasmussen@google.com> In-Reply-To: From: Jiaqi Yan Date: Wed, 17 May 2023 17:43:53 -0700 Message-ID: Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl To: Axel Rasmussen , Peter Xu , James Houghton Cc: Alexander Viro , Andrew Morton , Christian Brauner , David Hildenbrand , Hongchen Zhang , Huang Ying , "Liam R. Howlett" , Miaohe Lin , "Mike Rapoport (IBM)" , Nadav Amit , Naoya Horiguchi , Shuah Khan , ZhangPeng , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Anish Moorthy Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 8yqq1gdg3k95midmxyjywe8sdb9z3ne7 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 1F9EC1C0003 X-Rspam-User: X-HE-Tag: 1684370644-991623 X-HE-Meta: U2FsdGVkX1/ZuZZL6FuBSK4XoveYzGPnY9P5oGZrL/Mh9sR5eoLWpO475ixQHdj1GEuhVasPXWcnphyy1ChglnMP4MX/hjz4mkV6nGQIYkcOnp6Op0lzVzA79W5FQvvMfz7x0wsXi2te9Q1emVjna7hZNF5b2N69X+BpBi9UddX3MFMjdORAmEHnGtAcdYf4AItvWkAYw0JqMhaQA6VKtTFcAr5zPoxY5NFJKBzncKQc/4nxi0pAJl4YhqEr65t19idc+UA3f3slQZJko5ozLL9lC/1gGLzecUwFRCYbqhe6TTscr5zG3QrSSYm+/ddI31fUQs+3RiAjvT0/an7J4DsewjtLhjhTTys1IYE+9vRGqWPC8/qL89WQlxkk0xgoetfq0grhNCflyq5c0yUxHKJH3lJcTmCfGmNKvuVe0YvAaWd1Bye/k4G1Hjs3jsFh0zM8S2ENGlXj42IbCerl0JMSXBYNTCltGGRucODsd2M5x3jFXNunck+rRafWj5Fx3bHys283QqFW6AFXlvrrCQX/rDC+JYBnJb0sJfNvotp5wf3KAy+ZowsMwSFDFiUkC5wfLdvggXi4Cmhy1cpd7zLh8qzME7cgowWR/pX1Nr4njWNKyQi3kVPRPjZPACsh+uf+xn1h1nFogqW7hiytRLh+/zRVPj8ABdAFfeWivE1Z0tvogG4ctTeaWei92KDCAyqbdzHa1PtHe5m8xUIgMCF59VCuddLlurtyO7eILrgg8ufD+6eUsSvCAl7aE2xOfl2/BC7C7oKW0WC/gD2it9dIUtZj80vvg8ZPDLWACTCwNCZcrmp86xJ83l+UGtOknjpasxMBU8rbV0ypsttdzQRMpgculuQYNQeGGCgBJzbdZFXW37Sl1lfHZI0skIwW2h0HX9UqHjlAGifIlAZbfe3JhMLoR5HJDVgkREFcP3YHZG0OX2y5bsxpWEGUnxfZ+UkNLeOzPBAbz4Fk8O0 TE27hKrv camlnMF3B3a5vgoMn/vXTE6/fosBDG+FZ4MQRxsiBa4nF6eSBJvgQ2UBx9PUyYKwq2jNxgRZszfdQ0h833i3WFWnRJtcjltyk0qAg3PPNOHwMNBU3j68odMl40jZDjuY0BWb38q4EdT0/Pz6BCWAkkx7rRE3uAFvrl5PaQwS8M1wVPcpvFlCXn8QsrOVoes9qBRAI/HAnQcplyCGAMRK1pDpf7i/IaqQjgY0srKokvkbZ234QMwgT1NLTKARk7Xe3hDTdtE5FT3IK3Le57lw7Z1Rn/pATQr73ulMkn/J4YQUehz3uBz7vhEANwLoHJITK8jhapZtZyw2vY0TKrFHOmUElRHLICJvBo+UazxVnXE1C7m0xTZk4eXvGJrI/hnrJCt+RACG7ddR2tZp0couoERP78Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, May 17, 2023 at 3:29=E2=80=AFPM Axel Rasmussen wrote: > > On Wed, May 17, 2023 at 3:20=E2=80=AFPM Peter Xu wrot= e: > > > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > > > > On Thu, May 11, 2023 at 11:24=E2=80=AFAM Axel Rasmussen > > > > wrote: > > > > > > > > > > So the basic way to use this new feature is: > > > > > > > > > > - On the new host, the guest's memory is registered with userfaul= tfd, in > > > > > either MISSING or MINOR mode (doesn't really matter for this pu= rpose). > > > > > - On any first access, we get a userfaultfd event. At this point = we can > > > > > communicate with the old host to find out if the page was poiso= ned. > > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap= marker > > > > > so any future accesses will SIGBUS. Because the pte is now "pre= sent", > > > > > future accesses won't generate more userfaultfd events, they'll= just > > > > > SIGBUS directly. > > > > > > > > I want to clarify the SIGBUS mechanism here when KVM is involved, > > > > keeping in mind that we need to be able to inject an MCE into the > > > > guest for this to be useful. > > > > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. > > > > 3. KVM finds that GUP failed and returns -EFAULT. > > > > > > > > This is different than if GUP found poison, in which case KVM will > > > > actually queue up a SIGBUS *containing the address of the fault*, a= nd > > > > userspace can use it to inject an appropriate MCE into the guest. W= ith > > > > UFFDIO_SIGBUS, we are missing the address! > > > > > > > > I see three options: > > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think > > > > this is pointless. > > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a > > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGB= US > > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repea= ted > > > > accesses, just like how we get repeated signals for real poison. > > > > 3. Use this in conjunction with the additional KVM EFAULT info that > > > > Anish proposed (the first part of [1]). > > > > > > > > I think option 3 is fine. :) > > > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? = :) > > > > I just remember Axel mentioned this in the commit message, and just in = case > > this is why option 4) was ruled out: > > > > They expect that once poisoned, pages can never become > > "un-poisoned". So, when we live migrate the VM, we need to pres= erve > > the poisoned status of these pages. > > > > Just to supplement on this point: we do have unpoison (echoing to > > "debug/hwpoison/hwpoison_unpoison"), or am I wrong? If I read unpoison_memory() correctly, once there is a real hardware memory corruption (hw_memory_failure will be set), unpoison will stop working and return EOPNOTSUPP. I know some cloud providers evacuating VMs once a single memory error happens, so not supporting unpoison is probably not a big deal for them. BUT others do keep VM running until more errors show up later, which could be long after the 1st error. > > > > > > > > Besides what James mentioned on "missing addr", I didn't quickly see = what's > > > the major difference comparing to the old hwpoison injection methods = even > > > without the addr requirement. If we want the addr for MCE then it's m= ore of > > > a question to ask. > > > > > > I also didn't quickly see why for whatever new way to inject a pte er= ror we > > > need to have it registered with uffd. Could it be something like > > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject ev= en > > > without an userfault context (but still usable when uffd registered)? > > > > > > And it'll be alawys nice to have a cover letter too (if there'll be a= new > > > version) explaining the bits. > > I do plan a v2, if for no other reason than to update the > documentation. Happy to add a cover letter with it as well. > > +Jiaqi back to CC, this is one piece of a larger memory poisoning / > recovery design Jiaqi is working on, so he may have some ideas why > MADV_HWPOISON or MADV_PGER will or won't work. Per https://man7.org/linux/man-pages/man2/madvise.2.html, MADV_HWPOISON "is available only for privileged (CAP_SYS_ADMIN) processes." So for a non-root VMM, MADV_HWPOISON is out of option. Another issue with MADV_HWPOISON is, it requires to first successfully get_user_pages_fast(). I don't think it will work if memory is not mapped yet. With the UFFDIO_SIGBUS feature introduced in this patchset, it may even be possible to free the emulated-hwpoison page back to the kernel so we don't lose a 4K page. I didn't find any ref/doc for MADV_PGERR. Is it something you suggest to build, Peter? > > One idea is, at least for our use case, we have to have the range be > userfaultfd registered, because we need to intercept the first access > and check at that point whether or not it should be poisoned. But, I > think in principle a scheme like this could work: > > 1. Intercept first access with UFFD > 2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the > poisoned page in place > 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and = SIGBUS > > It's arguably slightly weird, since normally UFFD events are resolved > with UFFDIO_* operations, but I don't see why it *couldn't* work. > > Then again I am not super familiar with MADV_HWPOISON, I will have to > do a bit of reading to understand if its semantics are the same > (future accesses to this address get SIGBUS). > > > > > > > > Thanks, > > > > > > -- > > > Peter Xu > > > > -- > > Peter Xu > >