From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8807BC7EE23 for ; Wed, 17 May 2023 22:29:16 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1E7EF900004; Wed, 17 May 2023 18:29:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 171F5900003; Wed, 17 May 2023 18:29:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F2CF6900004; Wed, 17 May 2023 18:29:15 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id DBCD2900003 for ; Wed, 17 May 2023 18:29:15 -0400 (EDT) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 8DF9F4071D for ; Wed, 17 May 2023 22:29:15 +0000 (UTC) X-FDA: 80801189070.29.A45BC54 Received: from mail-qv1-f41.google.com (mail-qv1-f41.google.com [209.85.219.41]) by imf22.hostedemail.com (Postfix) with ESMTP id B31F0C0009 for ; Wed, 17 May 2023 22:29:13 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=M+WlNWdc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=axelrasmussen@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1684362553; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=d2BAYWqQjLBVkE55QBWk4MFNy2EEFsBkFqfU0qi3jzc=; b=NssvUp5dFTldzKkMqjhqIFecwPWPL/brBgzRgLzz3Aol7GiQ6ZyJ/50GqJX8tEZE0a3tRz yRQ16FTFB03vyTknUqS5vDyJOccMXuXgW1l8GrNwbufVR0DP1k9mdyARaHSKLn3PWwO8an fqJeu85XaQ1tR+IE2jrQR8gvuKJVoJE= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=M+WlNWdc; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf22.hostedemail.com: domain of axelrasmussen@google.com designates 209.85.219.41 as permitted sender) smtp.mailfrom=axelrasmussen@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1684362553; a=rsa-sha256; cv=none; b=oQjzlucJoiLV0kWLjl9wim0w8S67yGWJcICEUnZZUmonPc2RcCwsG6W0FV8Wh5RamGlzlS rc8HnukyS2socKhui+yR1t3bzXtIiD8t7Wse0fAMtpfb47tqpRQ+MGtQlzzRRreECHIjDc WYQN93i0guqWiteqnYcncM4ZiWag6lw= Received: by mail-qv1-f41.google.com with SMTP id 6a1803df08f44-61cd6191a62so6108616d6.3 for ; Wed, 17 May 2023 15:29:13 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1684362552; x=1686954552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=d2BAYWqQjLBVkE55QBWk4MFNy2EEFsBkFqfU0qi3jzc=; b=M+WlNWdcv9Cj8/VGWAKAsYqCgExA/M4rLKGdFF8/npUX4+QKU2Jdh4rGJInziOkc06 osc5shbKCWK2/zUsItJLjJLF8OQs4gZTwJ3KwQ/awo9F5wliR92evQJxQTr/s8VATbPn V1ciBuxnbdZYpX+lFYfWTbZ7MJoA/YCHRf7ZaRq/UZwEF3OywvlaAHY+WaapyLNudmAQ 8szFUz+2/21OEWWQPzpATMbuH+QcBzzCqGTjOG2KmIzine92NOWjNXZu0H9U/qE3Gy1R uZtzj3xFn+Y9FLGGPappEm7s5I5/dozIXeJNSS8So42TNY4NElzX9HRJyZ9BVcWNTNq5 rEDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1684362552; x=1686954552; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=d2BAYWqQjLBVkE55QBWk4MFNy2EEFsBkFqfU0qi3jzc=; b=cKD4Xy+PSTH3rs7pJfc/QrsW5bW2qHtscYYlggCPZO6jd8obvK4dlhbM49BcTW8YNm j6Lw2OP0yprK6pk/M/JX9rp6B+7QzN5ZtrN9TXJQvZb+BXm2dOHPrnAg+ZGvQge/P0wo KVYJ5l5jsxw4BMEW+wxz7Y4mdH9P0cLU6Cm3RZaopgRcOXR8wkyPPKDegdEhT5jwwzsr tzCRCCOl+aQbfA7jl+rRfqYApLI0V328J7j6HTdsuQFPKMwvICqINdIqDZks2gijR0vS jv4DY5nVf0cLf/8nMPYxfYYFS245+xV8yMFY8dXz7HrfAvn1I0wAJMyfakPrB+niRf1l HyVA== X-Gm-Message-State: AC+VfDxfLkZ6b/n5Qw6Ujsep577/LXyd3M62YfLksNPtjVXFqmbxW4OS ZjUcObIFK5o8m1C8CGIhVinIpIA3mt0RufGN+IeakQ== X-Google-Smtp-Source: ACHHUZ7Guasa0/glqZ3dMmdq/YN/vt/4W8KhAwxFkhzCAtGjBDLBN6DpJintMitaajo2c9Cb/otlw/043fECueJm72c= X-Received: by 2002:a05:6214:518b:b0:621:64c7:235f with SMTP id kl11-20020a056214518b00b0062164c7235fmr2158490qvb.27.1684362552551; Wed, 17 May 2023 15:29:12 -0700 (PDT) MIME-Version: 1.0 References: <20230511182426.1898675-1-axelrasmussen@google.com> In-Reply-To: From: Axel Rasmussen Date: Wed, 17 May 2023 15:28:36 -0700 Message-ID: Subject: Re: [PATCH 1/3] mm: userfaultfd: add new UFFDIO_SIGBUS ioctl To: Peter Xu Cc: James Houghton , Alexander Viro , Andrew Morton , Christian Brauner , David Hildenbrand , Hongchen Zhang , Huang Ying , "Liam R. Howlett" , Miaohe Lin , "Mike Rapoport (IBM)" , Nadav Amit , Naoya Horiguchi , Shuah Khan , ZhangPeng , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Anish Moorthy , Jiaqi Yan Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: oy15tgzqsu3c88sfxzhtrme61yhanyik X-Rspam-User: X-Rspamd-Queue-Id: B31F0C0009 X-Rspamd-Server: rspam07 X-HE-Tag: 1684362553-539401 X-HE-Meta: U2FsdGVkX18N4QjEy0wr6FsnoiHnR0sh1oYUyz270YASalyX8tunPJV0M/Vk8WGXPDK5IqW4VQ4+ErYtt0H+4KU3b/3iqM9FIwyi3UDKKA2x3kW14uaHKtJ+Gw2daazfcPZP6v1dD4xmn9xCXQ7PwKqPoFXYbIVgIYdB47A4CSy/B+OpRHz8ILHmt8X07Q8LdPSJFrF30A/LfJd6p5zkSICxj7cOPpx4vZAA4NNg29g3Uw9wghBOz79FAsv97MQNBZ8jJ8VGHT/MjbZuHLLIof1zda2t0N80pFGa98AQuhhBfs4kAtN40Tji3n4mxCNf4Wc8QAaK4941/rPAZs8cQxB1wBjchPCOr5lffi1H3dpLpCdTtpnh/tWbNihsipk4RKlecjW3eI6g22TbeTlGmTjrL4kRFGzb+jT+fABfyGn8kfB64sEp9Nh6ZbGZyP1yTPHZlRJQqrn3ocZQGrlWSEz2v4GF/9IPPi468jtiDJvuM2S5ENyKHlaBDUWRc6q3KDHuMWXam/7C+BbZwDF/trcXJ3ZAcDp3q56dpSWYTJU3HJdAdY73CTe0umX4ebOD4bjMtE1RnrvJ40Ia8qGpVcptZKTNXN4+094pQfG1w0aiXwMdsOWOOTdLfBOUPLzzyqjffyGCoL+POLQEY5d/iZeI7wMvf4J4ghvNGpM2krHDpi/3THNxemDhWXfxaKEA0ZszdYmtz4ZdMN3FbF8tjCATfI5lcJSOW2z0U9S2LcHNBpvJo+BMjzjK/qfWv4fbBxJ2wNlZjWh71LKIHa1mNjKvx9CF5Zc3r5k9ME68JI4TsysunvXEN6elqalmJJeBSLXsJsCHhjKvssnO8EWS0fnsAh7trWb8PlNxLB8fkAybdJW3aXqIJULGhhDTP7LJKzx2bLrdzkW/yEnkCZcE8f5O2qRAlHK2ejw0owHmcT6joW5nFXS5/+K7KJ2+tlPP2jeE5+YDYsG3oOMzYfn XNpmJdFC VnTmn6pGlAhMt0zBkUMjdKsJHGGQDQz/y6AgrMvSZOUcVV7ycBVHoPYNRXauEBn1ALVVSant564+jIQrAhU3OepgI+bOs8FMXHFMaYNfrvVzBvTuuzSxPc2k3Hq2B1K4R4v8OZaO2V3i7InlDbmB6q0PZ3dXbLnHGpA1KlD/BcabMYpJg9PZJivCyWE54/nMBzLkS1Rx139c7vgKVxQS3lAOTztf+Wy96t7gGNZ7AWswEp62UY6mP6NNvqn8dH6vthyJw8Nazj3KHpu/o73f5kpIWQDo2fWMaAb7lvPZ+1VJ9gj1+76pe0tez+9UfD+NmFsIN X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, May 17, 2023 at 3:20=E2=80=AFPM Peter Xu wrote: > > On Wed, May 17, 2023 at 06:12:33PM -0400, Peter Xu wrote: > > On Thu, May 11, 2023 at 03:00:09PM -0700, James Houghton wrote: > > > On Thu, May 11, 2023 at 11:24=E2=80=AFAM Axel Rasmussen > > > wrote: > > > > > > > > So the basic way to use this new feature is: > > > > > > > > - On the new host, the guest's memory is registered with userfaultf= d, in > > > > either MISSING or MINOR mode (doesn't really matter for this purp= ose). > > > > - On any first access, we get a userfaultfd event. At this point we= can > > > > communicate with the old host to find out if the page was poisone= d. > > > > - If so, we can respond with a UFFDIO_SIGBUS - this places a swap m= arker > > > > so any future accesses will SIGBUS. Because the pte is now "prese= nt", > > > > future accesses won't generate more userfaultfd events, they'll j= ust > > > > SIGBUS directly. > > > > > > I want to clarify the SIGBUS mechanism here when KVM is involved, > > > keeping in mind that we need to be able to inject an MCE into the > > > guest for this to be useful. > > > > > > 1. vCPU gets an EPT violation --> KVM attempts GUP. > > > 2. GUP finds a PTE_MARKER_UFFD_SIGBUS and returns VM_FAULT_SIGBUS. > > > 3. KVM finds that GUP failed and returns -EFAULT. > > > > > > This is different than if GUP found poison, in which case KVM will > > > actually queue up a SIGBUS *containing the address of the fault*, and > > > userspace can use it to inject an appropriate MCE into the guest. Wit= h > > > UFFDIO_SIGBUS, we are missing the address! > > > > > > I see three options: > > > 1. Make KVM_RUN queue up a signal for any VM_FAULT_SIGBUS. I think > > > this is pointless. > > > 2. Don't have UFFDIO_SIGBUS install a PTE entry, but instead have a > > > UFFDIO_WAKE_MODE_SIGBUS, where upon waking, we return VM_FAULT_SIGBUS > > > instead of VM_FAULT_RETRY. We will keep getting userfaults on repeate= d > > > accesses, just like how we get repeated signals for real poison. > > > 3. Use this in conjunction with the additional KVM EFAULT info that > > > Anish proposed (the first part of [1]). > > > > > > I think option 3 is fine. :) > > > > Or... option 4) just to use either MADV_HWPOISON or hwpoison-inject? :) > > I just remember Axel mentioned this in the commit message, and just in ca= se > this is why option 4) was ruled out: > > They expect that once poisoned, pages can never become > "un-poisoned". So, when we live migrate the VM, we need to preser= ve > the poisoned status of these pages. > > Just to supplement on this point: we do have unpoison (echoing to > "debug/hwpoison/hwpoison_unpoison"), or am I wrong? > > > > > Besides what James mentioned on "missing addr", I didn't quickly see wh= at's > > the major difference comparing to the old hwpoison injection methods ev= en > > without the addr requirement. If we want the addr for MCE then it's mor= e of > > a question to ask. > > > > I also didn't quickly see why for whatever new way to inject a pte erro= r we > > need to have it registered with uffd. Could it be something like > > MADV_PGERR (even if MADV_HWPOISON won't suffice) so you can inject even > > without an userfault context (but still usable when uffd registered)? > > > > And it'll be alawys nice to have a cover letter too (if there'll be a n= ew > > version) explaining the bits. I do plan a v2, if for no other reason than to update the documentation. Happy to add a cover letter with it as well. +Jiaqi back to CC, this is one piece of a larger memory poisoning / recovery design Jiaqi is working on, so he may have some ideas why MADV_HWPOISON or MADV_PGER will or won't work. One idea is, at least for our use case, we have to have the range be userfaultfd registered, because we need to intercept the first access and check at that point whether or not it should be poisoned. But, I think in principle a scheme like this could work: 1. Intercept first access with UFFD 2. Issue MADV_HWPOISON or MADV_PGERR or etc to put a pte denoting the poisoned page in place 3. UFFDIO_WAKE to have the faulting thread retry, see the new entry, and SI= GBUS It's arguably slightly weird, since normally UFFD events are resolved with UFFDIO_* operations, but I don't see why it *couldn't* work. Then again I am not super familiar with MADV_HWPOISON, I will have to do a bit of reading to understand if its semantics are the same (future accesses to this address get SIGBUS). > > > > Thanks, > > > > -- > > Peter Xu > > -- > Peter Xu >