From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 58748C47258 for ; Fri, 26 Jan 2024 01:25:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id DF6058D0008; Thu, 25 Jan 2024 20:25:53 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id DA5BB8D0002; Thu, 25 Jan 2024 20:25:53 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C6D088D0008; Thu, 25 Jan 2024 20:25:53 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id B6E6C8D0002 for ; Thu, 25 Jan 2024 20:25:53 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 5C0BC140DFE for ; Fri, 26 Jan 2024 01:25:53 +0000 (UTC) X-FDA: 81719720586.13.81F8494 Received: from mail-wr1-f47.google.com (mail-wr1-f47.google.com [209.85.221.47]) by imf14.hostedemail.com (Postfix) with ESMTP id 7C99C10000B for ; Fri, 26 Jan 2024 01:25:51 +0000 (UTC) Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=RykpcjDh; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1706232351; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=v7sgtjDmnNmYT5wo9p5HigR77VUYiLKMX+832fMlFkk=; b=Ffqg9X6i4pTm0c6by8UmjNeLRArLUv7/11hkwa190gnnpfYEGjqHN4pixYCFm85Z2P+3be /ye4F4PNwJkUMrCORlUNfk12rUoCh7kMByAGZAhQwp1A74hBjMecYpJqql07LZ6wJ3+sm1 Yi/QhFq0fCDaGJSfERQrCXF9CSUHNzw= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706232351; a=rsa-sha256; cv=none; b=jG0VJU16dqujDdat+oWpkFMyzGP5aL82lnOWNsoEVeNX8J3jxcn7uBzfLPPPRB6NBn+CY8 gTjFvahZRZXGwTfRmigiNTxg4lLv+neFO+liiAsBLbc8JI2BC3krlJjAX97FKQs9lvjMDz WVQEGKabMEyaqto6U7ssPPsOOt6wes0= ARC-Authentication-Results: i=1; imf14.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=RykpcjDh; spf=pass (imf14.hostedemail.com: domain of surenb@google.com designates 209.85.221.47 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-wr1-f47.google.com with SMTP id ffacd0b85a97d-339289fead2so6228438f8f.3 for ; Thu, 25 Jan 2024 17:25:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1706232350; x=1706837150; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=v7sgtjDmnNmYT5wo9p5HigR77VUYiLKMX+832fMlFkk=; b=RykpcjDhpDQuAVkAhBQo0vRY/fAukqxJEOhdvT4xtYoIVY53sPQWkNo7Kf/gZ/WbOf jaPlRNxT2ZuGPz9vsuPjTlAzfFSw3tp+7eFlADwpgxtV5T4HrqD1LrkrrxP1mjjlC+SP knEyBjSvJoWpQOFhyukWoAb+T84u7sC+l6MrJHDxjxJ2cubedZ1N/645C9yGs6/1kfJg Ot/oN1MAmTvjJ3ZLqrqoeYWUwtm6psRGnnOycD7LSK5E7EJHeBI5L9SYtetxZcgUDyon GBo4L6jnsX/G1tnuFbFaFmMhg3mUq7ay1XahtI/8d5eesw88HT4DjornXWt/EhS30GHP xR4A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1706232350; x=1706837150; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=v7sgtjDmnNmYT5wo9p5HigR77VUYiLKMX+832fMlFkk=; b=oZtyWNk3Q/P1cu4hw6xz/Xps6st2allyQwEBiQr7Prnmyas7LMizW95CZbDEm2NuZB mEYE88brMbTSIFZqJax/3WqbFwRfFjD8Xh/uhPZ+B4h7rssjv/YpUaKG2V2hp/fUONG2 4WUarkGZo8K4GU0NWDidl383Lr6VWMZO48p1IzZYWWYQBGGB4gUOy6ssSndNSnKZf9Mi SytorkfviG33szce9AFVyWHfaGXITRKbA/IaY3/J5LBbJUVSwfXU/1BvIHn/Di9sXtNs rqr/1DmCvSod+uZfUZ26f32jFSJe0oDVbj0r2GyMfO8x8U1SLeyxITe4msPvKa9l7L7D KZVg== X-Gm-Message-State: AOJu0YxFmZTCiiCyb6wkrtUh4lB7vVhx0Ll3Q1SB6RyudnlO+f/qyR8Y 85HDnERwE6JdWLZ8WFNoSW3kXhbgmD8Ec0dV+Am/6LMmty5X3mrFZIO0bQqCAe/OwGDeRffHVyd w2Mc2yAvch/gpX2oUGM+Ok0KGNqOx2s4MgwEg X-Google-Smtp-Source: AGHT+IER/4v3zUypLyBKaJxlWHigPZgT5Qc7CB5E/NSTIAKyQ498U2hsglxxXV6b0MObkMfMt63Qm/Xbcm0c1PNOAtA= X-Received: by 2002:a05:6000:12:b0:337:c075:d2b4 with SMTP id h18-20020a056000001200b00337c075d2b4mr372076wrx.62.1706232349633; Thu, 25 Jan 2024 17:25:49 -0800 (PST) MIME-Version: 1.0 References: <20240125001328.335127-1-surenb@google.com> In-Reply-To: <20240125001328.335127-1-surenb@google.com> From: Suren Baghdasaryan Date: Thu, 25 Jan 2024 17:25:37 -0800 Message-ID: Subject: Re: [PATCH 1/2] userfaultfd: handle zeropage moves by UFFDIO_MOVE To: akpm@linux-foundation.org Cc: viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, lokeshgidra@google.com, peterx@redhat.com, david@redhat.com, ryan.roberts@arm.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, jannh@google.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7C99C10000B X-Rspam-User: X-Stat-Signature: t4gqc77zqz3154hw8yc5mi3g96ma1gu5 X-Rspamd-Server: rspam03 X-HE-Tag: 1706232351-428769 X-HE-Meta: U2FsdGVkX1/KRK0Fh4vMpXYp25v6lQoiLBEnsG8Ke5oBSkP5Y9U399pgPL7o4Lmv4pBZtUWTd33gjB2BstQfVYRMy0XTYA3AFokvm5qZOuh+8EQ+7ApjNPIv83fqQMSvgRvqrJ85/9ShpXxRGR2q65n6RdNM5r0H0IZNfAN/WFoK2lx8kB76TqddtaP206K+QGYKr8ni1XV/wTHPMFIC2YdEatRaChw7Q+dQx3S40tSRLbDd3j9BpGwh33G/eR53N7dZsZ4T2etRWlONz59eTH1vG0v40gBGgWM3KaavurmKit0Dipqf4rovb79xBjuAXC/0u3CPYuUPe0GpZB3YhqkJyZ7eDkvtqlSUrh1YxLqPQAAvmWrOMJhPLD8jBCe/ayB3ksnGjPq6bNO5ENyIZSkar+r9QvDp4LokAhsHHNrxcADn8REC7TReDtcCJKvKcvhL/VV6u6NkxirOgWAihv25hIXAgBfsjJHZOhmguW25BZXhJC08H9fmC6j1RPFb0dp0UT/lE2rhVREbNRr+wj3BirQjjVqzO4FtKjcKCXjPXhHN6SgSBWOeKg811Kup2VEdwUVx1dqenBvYZ3v52UJaMb3zme+lsldWSNxq9lhn0OQvMZwP2HRwO5l2KRGiLkQ5EYnAiCTLKrPT29GAQlk54XIsSn9mg6OA2dyhyilQtM+OdzAX4Y8kZ3Uw0yY060gu5pv0V9lyqV9uqqDT4wXyUO/50vTqg97Nw8fCTGf6ZjP4yp6+AJpQ38OvWNq4FtRTPlmJjZX5q33uLpMyp33N+y5W+Ar68gpxhWJ3ZOZ465dwojyNH0rQEOBkzQbMUsMSgwXK2ZbbW4+UCc6ARtenCpj4jTbF5Y+OtYE2SkpUzUop706xSdU5S1qi4j5qzvh7sH7tb8IkF3tQzEjAwB988myYlnS9YLfqIa+LP1qyJTG/3nqa6069JtFwbbgp42kB3zSi3GlxCBSvH0U J0v25w3Q At3GXfZKZZUDiTOfBi6BuhCRZUBUyeBJdEQoyGW3GmIq/cdURIYyghgVwojfCpQ6jk6PCByodOl4rw/xZtUBkzPNpePZEaMPLUdKix3B2MqdcrFbTpj0LVk9SZw7+Q2Z7e7QP8TvMM1TjLUVeV1Vq7nuhMDWX+fDZI1Q7WzMXqI5PUyzLUF30ylXznyi3Hh5SekvpSl7YxhSE3cxngPsKWHtFx3Tmn3yq4Ey1O/yjd+yQbguDVz3oXPnrtHzWDBMtC9fAJh0yEGkgoFaNISPVLMeN7Lc8A9tUz2gYgZDtJPBDd9s= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jan 24, 2024 at 4:13=E2=80=AFPM Suren Baghdasaryan wrote: > > Current implementation of UFFDIO_MOVE fails to move zeropages and returns > EBUSY when it encounters one. We can handle them by mapping a zeropage > at the destination and clearing the mapping at the source. This is done > both for ordinary and for huge zeropages. I made a stupid mistake when formatting this patch and it says [PATCH 1/2] but it should be the only patch in the set. So, please do not look for [2/2]. Sorry about the confusion. Thanks, Suren. > > Signed-off-by: Suren Baghdasaryan > --- > Applies cleanly over mm-unstable branch. > > mm/huge_memory.c | 105 +++++++++++++++++++++++++++-------------------- > mm/userfaultfd.c | 42 +++++++++++++++---- > 2 files changed, 96 insertions(+), 51 deletions(-) > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index f40feb31b507..5dcc02c25e97 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -2190,13 +2190,18 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd, pm > } > > src_page =3D pmd_page(src_pmdval); > - if (unlikely(!PageAnonExclusive(src_page))) { > - spin_unlock(src_ptl); > - return -EBUSY; > - } > > - src_folio =3D page_folio(src_page); > - folio_get(src_folio); > + if (!is_huge_zero_pmd(src_pmdval)) { > + if (unlikely(!PageAnonExclusive(src_page))) { > + spin_unlock(src_ptl); > + return -EBUSY; > + } > + > + src_folio =3D page_folio(src_page); > + folio_get(src_folio); > + } else > + src_folio =3D NULL; > + > spin_unlock(src_ptl); > > flush_cache_range(src_vma, src_addr, src_addr + HPAGE_PMD_SIZE); > @@ -2204,19 +2209,22 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd, pm > src_addr + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > > - folio_lock(src_folio); > + if (src_folio) { > + folio_lock(src_folio); > > - /* > - * split_huge_page walks the anon_vma chain without the page > - * lock. Serialize against it with the anon_vma lock, the page > - * lock is not enough. > - */ > - src_anon_vma =3D folio_get_anon_vma(src_folio); > - if (!src_anon_vma) { > - err =3D -EAGAIN; > - goto unlock_folio; > - } > - anon_vma_lock_write(src_anon_vma); > + /* > + * split_huge_page walks the anon_vma chain without the p= age > + * lock. Serialize against it with the anon_vma lock, the= page > + * lock is not enough. > + */ > + src_anon_vma =3D folio_get_anon_vma(src_folio); > + if (!src_anon_vma) { > + err =3D -EAGAIN; > + goto unlock_folio; > + } > + anon_vma_lock_write(src_anon_vma); > + } else > + src_anon_vma =3D NULL; > > dst_ptl =3D pmd_lockptr(mm, dst_pmd); > double_pt_lock(src_ptl, dst_ptl); > @@ -2225,45 +2233,54 @@ int move_pages_huge_pmd(struct mm_struct *mm, pmd= _t *dst_pmd, pmd_t *src_pmd, pm > err =3D -EAGAIN; > goto unlock_ptls; > } > - if (folio_maybe_dma_pinned(src_folio) || > - !PageAnonExclusive(&src_folio->page)) { > - err =3D -EBUSY; > - goto unlock_ptls; > - } > + if (src_folio) { > + if (folio_maybe_dma_pinned(src_folio) || > + !PageAnonExclusive(&src_folio->page)) { > + err =3D -EBUSY; > + goto unlock_ptls; > + } > > - if (WARN_ON_ONCE(!folio_test_head(src_folio)) || > - WARN_ON_ONCE(!folio_test_anon(src_folio))) { > - err =3D -EBUSY; > - goto unlock_ptls; > - } > + if (WARN_ON_ONCE(!folio_test_head(src_folio)) || > + WARN_ON_ONCE(!folio_test_anon(src_folio))) { > + err =3D -EBUSY; > + goto unlock_ptls; > + } > > - folio_move_anon_rmap(src_folio, dst_vma); > - WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, dst_addr)= ); > + folio_move_anon_rmap(src_folio, dst_vma); > + WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, d= st_addr)); > > - src_pmdval =3D pmdp_huge_clear_flush(src_vma, src_addr, src_pmd); > - /* Folio got pinned from under us. Put it back and fail the move.= */ > - if (folio_maybe_dma_pinned(src_folio)) { > - set_pmd_at(mm, src_addr, src_pmd, src_pmdval); > - err =3D -EBUSY; > - goto unlock_ptls; > - } > + src_pmdval =3D pmdp_huge_clear_flush(src_vma, src_addr, s= rc_pmd); > + /* Folio got pinned from under us. Put it back and fail t= he move. */ > + if (folio_maybe_dma_pinned(src_folio)) { > + set_pmd_at(mm, src_addr, src_pmd, src_pmdval); > + err =3D -EBUSY; > + goto unlock_ptls; > + } > > - _dst_pmd =3D mk_huge_pmd(&src_folio->page, dst_vma->vm_page_prot)= ; > - /* Follow mremap() behavior and treat the entry dirty after the m= ove */ > - _dst_pmd =3D pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma); > + _dst_pmd =3D mk_huge_pmd(&src_folio->page, dst_vma->vm_pa= ge_prot); > + /* Follow mremap() behavior and treat the entry dirty aft= er the move */ > + _dst_pmd =3D pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma); > + } else { > + src_pmdval =3D pmdp_huge_clear_flush(src_vma, src_addr, s= rc_pmd); > + _dst_pmd =3D mk_huge_pmd(src_page, dst_vma->vm_page_prot)= ; > + } > set_pmd_at(mm, dst_addr, dst_pmd, _dst_pmd); > > src_pgtable =3D pgtable_trans_huge_withdraw(mm, src_pmd); > pgtable_trans_huge_deposit(mm, dst_pmd, src_pgtable); > unlock_ptls: > double_pt_unlock(src_ptl, dst_ptl); > - anon_vma_unlock_write(src_anon_vma); > - put_anon_vma(src_anon_vma); > + if (src_anon_vma) { > + anon_vma_unlock_write(src_anon_vma); > + put_anon_vma(src_anon_vma); > + } > unlock_folio: > /* unblock rmap walks */ > - folio_unlock(src_folio); > + if (src_folio) > + folio_unlock(src_folio); > mmu_notifier_invalidate_range_end(&range); > - folio_put(src_folio); > + if (src_folio) > + folio_put(src_folio); > return err; > } > #endif /* CONFIG_USERFAULTFD */ > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 3548b3e31a97..5fbf4da15c5c 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -959,6 +959,31 @@ static int move_swap_pte(struct mm_struct *mm, > return 0; > } > > +static int move_zeropage_pte(struct mm_struct *mm, > + struct vm_area_struct *dst_vma, > + struct vm_area_struct *src_vma, > + unsigned long dst_addr, unsigned long src_ad= dr, > + pte_t *dst_pte, pte_t *src_pte, > + pte_t orig_dst_pte, pte_t orig_src_pte, > + spinlock_t *dst_ptl, spinlock_t *src_ptl) > +{ > + pte_t zero_pte; > + > + double_pt_lock(dst_ptl, src_ptl); > + if (!pte_same(ptep_get(src_pte), orig_src_pte) || > + !pte_same(ptep_get(dst_pte), orig_dst_pte)) > + return -EAGAIN; > + > + zero_pte =3D pte_mkspecial(pfn_pte(my_zero_pfn(dst_addr), > + dst_vma->vm_page_prot)); > + ptep_clear_flush(src_vma, src_addr, src_pte); > + set_pte_at(mm, dst_addr, dst_pte, zero_pte); > + double_pt_unlock(dst_ptl, src_ptl); > + > + return 0; > +} > + > + > /* > * The mmap_lock for reading is held by the caller. Just move the page > * from src_pmd to dst_pmd if possible, and return true if succeeded > @@ -1041,6 +1066,14 @@ static int move_pages_pte(struct mm_struct *mm, pm= d_t *dst_pmd, pmd_t *src_pmd, > } > > if (pte_present(orig_src_pte)) { > + if (is_zero_pfn(pte_pfn(orig_src_pte))) { > + err =3D move_zeropage_pte(mm, dst_vma, src_vma, > + dst_addr, src_addr, dst_pt= e, src_pte, > + orig_dst_pte, orig_src_pte= , > + dst_ptl, src_ptl); > + goto out; > + } > + > /* > * Pin and lock both source folio and anon_vma. Since we = are in > * RCU read section, we can't block, so on contention hav= e to > @@ -1404,19 +1437,14 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, s= truct mm_struct *mm, > err =3D -ENOENT; > break; > } > - /* Avoid moving zeropages for now */ > - if (is_huge_zero_pmd(*src_pmd)) { > - spin_unlock(ptl); > - err =3D -EBUSY; > - break; > - } > > /* Check if we can move the pmd without splitting= it. */ > if (move_splits_huge_pmd(dst_addr, src_addr, src_= start + len) || > !pmd_none(dst_pmdval)) { > struct folio *folio =3D pfn_folio(pmd_pfn= (*src_pmd)); > > - if (!folio || !PageAnonExclusive(&folio->= page)) { > + if (!folio || (!is_huge_zero_page(&folio-= >page) && > + !PageAnonExclusive(&folio-= >page))) { > spin_unlock(ptl); > err =3D -EBUSY; > break; > -- > 2.43.0.429.g432eaa2c6b-goog >