From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7875BC001E0 for ; Fri, 7 Jul 2023 21:55:55 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 10DDE8D0003; Fri, 7 Jul 2023 17:55:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 0BD506B0078; Fri, 7 Jul 2023 17:55:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E544D8D0003; Fri, 7 Jul 2023 17:55:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id D778C6B0075 for ; Fri, 7 Jul 2023 17:55:54 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id A680FAF660 for ; Fri, 7 Jul 2023 21:55:54 +0000 (UTC) X-FDA: 80986173828.28.8F032EB Received: from mail-yb1-f201.google.com (mail-yb1-f201.google.com [209.85.219.201]) by imf14.hostedemail.com (Postfix) with ESMTP id D442310000D for ; Fri, 7 Jul 2023 21:55:51 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="MOZe/4J0"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of 35omoZA0KCMcnAry4n5z755r0t11tyr.p1zyv07A-zzx8npx.14t@flex--axelrasmussen.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=35omoZA0KCMcnAry4n5z755r0t11tyr.p1zyv07A-zzx8npx.14t@flex--axelrasmussen.bounces.google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688766951; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=13fFKLFK5UOuV/VKaxpsoKuDeTvwdO4Zasj0G6nnvw4=; b=BA9r+f07U6l8GgH+/CogDKe+v0JhpVxcbuoeYLAzO66C3Nr4+0nRAGeihz9MXu6cVoD+I3 O4lRT2n+Cl5pKKoPMMJE0J2LhPeBtDmrOy2CU+YRg2ox8BEC+Ooelm9LSmGQLb3I73Jz2X go5rviEgO0cHVGpLmaHWJxq3ETPiN/Q= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b="MOZe/4J0"; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf14.hostedemail.com: domain of 35omoZA0KCMcnAry4n5z755r0t11tyr.p1zyv07A-zzx8npx.14t@flex--axelrasmussen.bounces.google.com designates 209.85.219.201 as permitted sender) smtp.mailfrom=35omoZA0KCMcnAry4n5z755r0t11tyr.p1zyv07A-zzx8npx.14t@flex--axelrasmussen.bounces.google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688766951; a=rsa-sha256; cv=none; b=SyBAFz4JRZwHoZlzcaimtydHMvPU4erXxgeIjVm/8rCf+NisRoOP7UDfANyKGbZ719511Z mjUXiW5M40TPONmY9vDA5O2sRJUFA5mcHlas2FbLgjAwJ6cIOyz7GojTmNxQc8o/sawMSJ BbOS0x7ZSEShRGfjPYESh2MRrBr8Oco= Received: by mail-yb1-f201.google.com with SMTP id 3f1490d57ef6-c6dd0e46a52so1081518276.2 for ; Fri, 07 Jul 2023 14:55:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688766951; x=1691358951; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=13fFKLFK5UOuV/VKaxpsoKuDeTvwdO4Zasj0G6nnvw4=; b=MOZe/4J0t2eY56FwP6+UOnOCO967WEJl1TbQmpEjUkORQByt+zNCfplkM8DFDsfn2O r8uJClY/Y8Z5xV1YyuYdtmUraPk6/0cTJYHNmUSZbe/IN5Fu/cCN2a5n6nood0PBJ5ub r+oK/nXF6cKIsISrdC7GbKJnOlHWU/fYaofkbzYDjk4RMoXaSDwF7Xm8k0rAsTnBwL6M iIH+0ynit7jmNdoNEYTTc4GCckiC0R5j2gDKsv3fFbP7GHFkZpeE0NTYI7CtMCGaLWtX veWaDAElReFjefLsUGuuOtLuP8wj+UPVmDqBKLb1/cz2GLGjcNUxr2o85TMD4Ekuetv1 Bkvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688766951; x=1691358951; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=13fFKLFK5UOuV/VKaxpsoKuDeTvwdO4Zasj0G6nnvw4=; b=LMzCXxMk9pglNEVgWAqPgRWmrtdHVub6RiuirdQdMCC77IPQ1As3dZlJlagfqzxwiQ 1q2yAkwPHBBiNvOZX0aNVAoyz/qMKTPQnl4HWWf8Y0/ooHTxCwgI0nDBv5tAmYHLC7zK 0sFeEE7xbcOj4H6pA/IsGGq+/bORw6rJCiiTMVOsKTydPK8rVUNECIIDgfMkYTcVtcpx hB+psfHBGmbeVp45hIqxz4yHTLVAHHrk99eMm7EdBUTbCoCYMdnq0yQ+KYvjAy/IwLzA rnu+yzTxRhI88rDw7G4OSXpIIbdHezopIRRtSdCeohRRlVfY/sE74ZMo704wAgXJbO+V zlOA== X-Gm-Message-State: ABy/qLYZesTeKKO/O+/EiRUzSnpCs+CXTVeUZzG/PQ93CkuLqyVvPxta IEUwE1xygR0ndNSKhK0afZusM198ZdUuTASEsDR7 X-Google-Smtp-Source: APBJJlE6DEJsijlYNuZPo3I86aUhHhB3DQHO8fcNF8FmogVn4S+pUbHfGvuKc4J+MOTJwTKMVeH0FM48t2P4Yv+ogdSs X-Received: from axel.svl.corp.google.com ([2620:15c:2a3:200:c201:5125:39d1:ef3f]) (user=axelrasmussen job=sendgmr) by 2002:a5b:c86:0:b0:c02:7c99:62e with SMTP id i6-20020a5b0c86000000b00c027c99062emr64646ybq.13.1688766950995; Fri, 07 Jul 2023 14:55:50 -0700 (PDT) Date: Fri, 7 Jul 2023 14:55:36 -0700 In-Reply-To: <20230707215540.2324998-1-axelrasmussen@google.com> Mime-Version: 1.0 References: <20230707215540.2324998-1-axelrasmussen@google.com> X-Mailer: git-send-email 2.41.0.255.g8b1d071c50-goog Message-ID: <20230707215540.2324998-5-axelrasmussen@google.com> Subject: [PATCH v4 4/8] mm: userfaultfd: add new UFFDIO_POISON ioctl From: Axel Rasmussen To: Alexander Viro , Andrew Morton , Brian Geffon , Christian Brauner , David Hildenbrand , Gaosheng Cui , Huang Ying , Hugh Dickins , James Houghton , "Jan Alexander Steffens (heftig)" , Jiaqi Yan , Jonathan Corbet , Kefeng Wang , "Liam R. Howlett" , Miaohe Lin , Mike Kravetz , "Mike Rapoport (IBM)" , Muchun Song , Nadav Amit , Naoya Horiguchi , Peter Xu , Ryan Roberts , Shuah Khan , Suleiman Souhlal , Suren Baghdasaryan , "T.J. Alumbaugh" , Yu Zhao , ZhangPeng Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Axel Rasmussen Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: D442310000D X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: mfaug9om1hj1da7h4modqu1wydng16tt X-HE-Tag: 1688766951-118760 X-HE-Meta: U2FsdGVkX19j3PrU6NLLOBPeP/hdWPf7Tk9uUpkyNUJV2JrZsFmYiKZQwaAX5mmv1K9YevPyVZJwtAIeVbMuH+w4Z0F8z6x/rt8bBCmtsgsRPH9gOW3tXg2Cyj/5vYqKwgTIsf1B7VgdYiPBgZksmbpvHy6GzrWTEmkkbKwjc2zGiO17t9eU1vA1cO2IJEW75PNG5oGMvz/w336h/qDwaj+G6Dbc4x9Gv5RBlAWgCKBukxsMXpHhyjoJeAjwRFuGds3MYbBZ0Vh5pSNoVUiknwJp3WjRQxL6h1fU+mM588afWUMpH5Mqyma5DCkssHjl3NrECda4Kf1WnO8fWWd20Yd3i1qpgK1hmc3diUAye2j7Tgz0TsTF4/vzZ2jcw06BJENaqWRjba8VICrPwQXALtFXG/4PanzzEUF/XCP7SFqAHHxm9XmYjoqzI/4Cpbr/DT/nGzPnpQbbX6J5+5XYbQ3pRGrEuunwUvoSxN++8feZ3xQHfRfWY0J+Zxnw36RpP/OQJR74X5m2n1nLd7OKmyk+NGvekTnG+CxplMjz0K6ZCVs6DK5+4XFoHsDXmNwk7Xt0nI6b7dfRTwIKQgN6svzbBV2B/6w1eTgDDIqs+BxHqm8ADkY62tTOUi9/heWuhNJenR+P6DwUntJQU+eMReuuGbTgsYUnswogrTcKaZdnoYYchCcyECMZ2u68atS0Ad6e+zkVrCJLzI1kfV7QdRpVJJ93O3Fxnffhi17KHWAbTxLGuE7GkLjV+dHOac1EbAQ8RqWB1EMENTsT8ej9bpzf4zXgYfHuvzZqVkab228PYYjcCZXs59X088qbkAcT8mWzQK0m0SvL4by9ASQzQqTo1kO7/TroUWCWOCMGlDZr+DEu1dSgR3+sbvtj+ibdpBJ7h0uCpixA2W33TpxU78OU+EeuyNinYNg7Km3V6DL7w+8Kzf29jgNY1R09aKIWgDYii6B7D4OPxmHomi9 lQ+Bz4kG 7IbV6fyFGm+2pZVqErJkWkQ/szuak94yP4NgKKXq1VKBOdmqyqtUO7GsEWM90v7s6cms6p/gkyu/dlOKhmo3Xtyy5EaSblKoKzvXOwiNHIsT2PPKAI2kvQYkhWvsBIIhEwKT1zoxzPRBNomojTF/u1/X7cS8ut55NQmU57sNbLxOEbF6/JB/v4EXeiHStNlKrPkGU+KPm6e44szoBDGU4ihdAfnq7QyHbzpaOJJsbAFcgL8mYHCQkhFtN3Vjr9LZnrTp9q8rv0TozyY++z9854bCTWXy5BZ9W0soYIIyKnzyugNigVeOGr3Ttim6AdmGEjN8onJb2xKtOOFhWHgqWPppvk3rgVFn5AkMIcYTwg0iDlIQLWjPzk43KJPWFgA+OX+VMNVOuNaA6FmWs+B3jNnvmk0NPWubhmyrKaCfZE1B3CXTWKIoh4e+6UASI4cm6w/A16hO8Z3d4us8ZgJ98irtY8eRPmp66e+wtktjrngZmbCjlNm2kjn9ydOGI5aWwWEyuuv5M47/3JCbMMNSbUZSE+EN3hV0FO47Rf78vkq1QpD+k225VUYnGTu0+ktLFdqCBKxaSL7/Q3rfmi8BtDFzW36qkHQqjqWIFWQK40FRiQutOeGs0Px0i4Pu8K37osAfqJqzuURXYFSfMODeuU33BC92dqh/CK9D8lNCitecWFE+lE6YS1K0hpOGjaz+z0N3uz6WCihenU2/fbPJCqRHnzN1OpeN+Zz5NRSe/FjshQHo= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: The basic idea here is to "simulate" memory poisoning for VMs. A VM running on some host might encounter a memory error, after which some page(s) are poisoned (i.e., future accesses SIGBUS). They expect that once poisoned, pages can never become "un-poisoned". So, when we live migrate the VM, we need to preserve the poisoned status of these pages. When live migrating, we try to get the guest running on its new host as quickly as possible. So, we start it running before all memory has been copied, and before we're certain which pages should be poisoned or not. So the basic way to use this new feature is: - On the new host, the guest's memory is registered with userfaultfd, in either MISSING or MINOR mode (doesn't really matter for this purpose). - On any first access, we get a userfaultfd event. At this point we can communicate with the old host to find out if the page was poisoned. - If so, we can respond with a UFFDIO_POISON - this places a swap marker so any future accesses will SIGBUS. Because the pte is now "present", future accesses won't generate more userfaultfd events, they'll just SIGBUS directly. UFFDIO_POISON does not handle unmapping previously-present PTEs. This isn't needed, because during live migration we want to intercept all accesses with userfaultfd (not just writes, so WP mode isn't useful for this). So whether minor or missing mode is being used (or both), the PTE won't be present in any case, so handling that case isn't needed. Similarly, UFFDIO_POISON won't replace existing PTE markers. This might be okay to do, but it seems to be safer to just refuse to overwrite any existing entry (like a UFFD_WP PTE marker). Acked-by: Peter Xu Signed-off-by: Axel Rasmussen --- fs/userfaultfd.c | 58 ++++++++++++++++++++++++++++++++ include/linux/userfaultfd_k.h | 4 +++ include/uapi/linux/userfaultfd.h | 16 +++++++++ mm/userfaultfd.c | 48 +++++++++++++++++++++++++- 4 files changed, 125 insertions(+), 1 deletion(-) diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c index 2e84684c46f0..53a7220c4679 100644 --- a/fs/userfaultfd.c +++ b/fs/userfaultfd.c @@ -1956,6 +1956,61 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg) return ret; } +static inline int userfaultfd_poison(struct userfaultfd_ctx *ctx, unsigned long arg) +{ + __s64 ret; + struct uffdio_poison uffdio_poison; + struct uffdio_poison __user *user_uffdio_poison; + struct userfaultfd_wake_range range; + + user_uffdio_poison = (struct uffdio_poison __user *)arg; + + ret = -EAGAIN; + if (atomic_read(&ctx->mmap_changing)) + goto out; + + ret = -EFAULT; + if (copy_from_user(&uffdio_poison, user_uffdio_poison, + /* don't copy the output fields */ + sizeof(uffdio_poison) - (sizeof(__s64)))) + goto out; + + ret = validate_range(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len); + if (ret) + goto out; + + ret = -EINVAL; + if (uffdio_poison.mode & ~UFFDIO_POISON_MODE_DONTWAKE) + goto out; + + if (mmget_not_zero(ctx->mm)) { + ret = mfill_atomic_poison(ctx->mm, uffdio_poison.range.start, + uffdio_poison.range.len, + &ctx->mmap_changing, 0); + mmput(ctx->mm); + } else { + return -ESRCH; + } + + if (unlikely(put_user(ret, &user_uffdio_poison->updated))) + return -EFAULT; + if (ret < 0) + goto out; + + /* len == 0 would wake all */ + BUG_ON(!ret); + range.len = ret; + if (!(uffdio_poison.mode & UFFDIO_POISON_MODE_DONTWAKE)) { + range.start = uffdio_poison.range.start; + wake_userfault(ctx, &range); + } + ret = range.len == uffdio_poison.range.len ? 0 : -EAGAIN; + +out: + return ret; +} + static inline unsigned int uffd_ctx_features(__u64 user_features) { /* @@ -2057,6 +2112,9 @@ static long userfaultfd_ioctl(struct file *file, unsigned cmd, case UFFDIO_CONTINUE: ret = userfaultfd_continue(ctx, arg); break; + case UFFDIO_POISON: + ret = userfaultfd_poison(ctx, arg); + break; } return ret; } diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h index ac7b0c96d351..ac8c6854097c 100644 --- a/include/linux/userfaultfd_k.h +++ b/include/linux/userfaultfd_k.h @@ -46,6 +46,7 @@ enum mfill_atomic_mode { MFILL_ATOMIC_COPY, MFILL_ATOMIC_ZEROPAGE, MFILL_ATOMIC_CONTINUE, + MFILL_ATOMIC_POISON, NR_MFILL_ATOMIC_MODES, }; @@ -83,6 +84,9 @@ extern ssize_t mfill_atomic_zeropage(struct mm_struct *dst_mm, extern ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long dst_start, unsigned long len, atomic_t *mmap_changing, uffd_flags_t flags); +extern ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags); extern int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start, unsigned long len, bool enable_wp, atomic_t *mmap_changing); diff --git a/include/uapi/linux/userfaultfd.h b/include/uapi/linux/userfaultfd.h index 66dd4cd277bd..b5f07eacc697 100644 --- a/include/uapi/linux/userfaultfd.h +++ b/include/uapi/linux/userfaultfd.h @@ -71,6 +71,7 @@ #define _UFFDIO_ZEROPAGE (0x04) #define _UFFDIO_WRITEPROTECT (0x06) #define _UFFDIO_CONTINUE (0x07) +#define _UFFDIO_POISON (0x08) #define _UFFDIO_API (0x3F) /* userfaultfd ioctl ids */ @@ -91,6 +92,8 @@ struct uffdio_writeprotect) #define UFFDIO_CONTINUE _IOWR(UFFDIO, _UFFDIO_CONTINUE, \ struct uffdio_continue) +#define UFFDIO_POISON _IOWR(UFFDIO, _UFFDIO_POISON, \ + struct uffdio_poison) /* read() structure */ struct uffd_msg { @@ -225,6 +228,7 @@ struct uffdio_api { #define UFFD_FEATURE_EXACT_ADDRESS (1<<11) #define UFFD_FEATURE_WP_HUGETLBFS_SHMEM (1<<12) #define UFFD_FEATURE_WP_UNPOPULATED (1<<13) +#define UFFD_FEATURE_POISON (1<<14) __u64 features; __u64 ioctls; @@ -321,6 +325,18 @@ struct uffdio_continue { __s64 mapped; }; +struct uffdio_poison { + struct uffdio_range range; +#define UFFDIO_POISON_MODE_DONTWAKE ((__u64)1<<0) + __u64 mode; + + /* + * Fields below here are written by the ioctl and must be at the end: + * the copy_from_user will not read past here. + */ + __s64 updated; +}; + /* * Flags for the userfaultfd(2) system call itself. */ diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 4244ca7ee903..68157359dc34 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -288,6 +288,40 @@ static int mfill_atomic_pte_continue(pmd_t *dst_pmd, goto out; } +/* Handles UFFDIO_POISON for all non-hugetlb VMAs. */ +static int mfill_atomic_pte_poison(pmd_t *dst_pmd, + struct vm_area_struct *dst_vma, + unsigned long dst_addr, + uffd_flags_t flags) +{ + int ret; + struct mm_struct *dst_mm = dst_vma->vm_mm; + pte_t _dst_pte, *dst_pte; + spinlock_t *ptl; + + _dst_pte = make_pte_marker(PTE_MARKER_POISONED); + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); + + if (mfill_file_over_size(dst_vma, dst_addr)) { + ret = -EFAULT; + goto out_unlock; + } + + ret = -EEXIST; + /* Refuse to overwrite any PTE, even a PTE marker (e.g. UFFD WP). */ + if (!pte_none(*dst_pte)) + goto out_unlock; + + set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); + + /* No need to invalidate - it was non-present before */ + update_mmu_cache(dst_vma, dst_addr, dst_pte); + ret = 0; +out_unlock: + pte_unmap_unlock(dst_pte, ptl); + return ret; +} + static pmd_t *mm_alloc_pmd(struct mm_struct *mm, unsigned long address) { pgd_t *pgd; @@ -339,7 +373,8 @@ static __always_inline ssize_t mfill_atomic_hugetlb( * by THP. Since we can not reliably insert a zero page, this * feature is not supported. */ - if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE)) { + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_ZEROPAGE) || + uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { mmap_read_unlock(dst_mm); return -EINVAL; } @@ -483,6 +518,9 @@ static __always_inline ssize_t mfill_atomic_pte(pmd_t *dst_pmd, if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE)) { return mfill_atomic_pte_continue(dst_pmd, dst_vma, dst_addr, flags); + } else if (uffd_flags_mode_is(flags, MFILL_ATOMIC_POISON)) { + return mfill_atomic_pte_poison(dst_pmd, dst_vma, + dst_addr, flags); } /* @@ -704,6 +742,14 @@ ssize_t mfill_atomic_continue(struct mm_struct *dst_mm, unsigned long start, uffd_flags_set_mode(flags, MFILL_ATOMIC_CONTINUE)); } +ssize_t mfill_atomic_poison(struct mm_struct *dst_mm, unsigned long start, + unsigned long len, atomic_t *mmap_changing, + uffd_flags_t flags) +{ + return mfill_atomic(dst_mm, start, 0, len, mmap_changing, + uffd_flags_set_mode(flags, MFILL_ATOMIC_POISON)); +} + long uffd_wp_range(struct vm_area_struct *dst_vma, unsigned long start, unsigned long len, bool enable_wp) { -- 2.41.0.255.g8b1d071c50-goog