From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0183DE82CBB for ; Wed, 27 Sep 2023 18:08:05 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 7D4C36B00D4; Wed, 27 Sep 2023 14:08:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 75D886B00D9; Wed, 27 Sep 2023 14:08:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5AF4D6B00D5; Wed, 27 Sep 2023 14:08:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 44DAB6B00D3 for ; Wed, 27 Sep 2023 14:08:05 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 106E61408F5 for ; Wed, 27 Sep 2023 18:08:05 +0000 (UTC) X-FDA: 81283161330.05.494F06B Received: from mail-yw1-f181.google.com (mail-yw1-f181.google.com [209.85.128.181]) by imf12.hostedemail.com (Postfix) with ESMTP id 1361F40027 for ; Wed, 27 Sep 2023 18:08:02 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xrYTlemn; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695838083; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5MbbJwVcDVc6ilMgG+HaZre9uVeLJcrMIu8nlaztDFM=; b=y8rq+adAjb71ti0GddBPvSWmjjwENLk1zqgpbuyEKjp1GTdfWfxk3Z0hdSyn5pudLKIXrW DGvNRivJHwxjyORfz+A/YPiiexAY5/xtRmTnT5cbbAD4c//WVR0roXH/YuMZqpqKt0VE2z nkTo1R8wftjiREosd1zASl+jtIUyOVc= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695838083; a=rsa-sha256; cv=none; b=yYpVQXN34nK7vJU1dk+7yFndn23bo96vNET2LcPcy5UF4eUDDONI87iyDqBKM87Ap3/9Lp 6P92iUIRTsGhmWBzyrS3lKwJjltukSZRI10+xG1cxwa/8m6P7zWIjgTISqAS8oA3fNaUVV C4qwOO058y5HjUY5wn9Ef1Ydxeqcwis= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=xrYTlemn; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.128.181 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com Received: by mail-yw1-f181.google.com with SMTP id 00721157ae682-59f7f2b1036so88600867b3.3 for ; Wed, 27 Sep 2023 11:08:02 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695838082; x=1696442882; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=5MbbJwVcDVc6ilMgG+HaZre9uVeLJcrMIu8nlaztDFM=; b=xrYTlemndzdRYl7KCt87M8QtOhw6IcPYbf+WFu1UyrP9t4CjXiCKyNlp4sajNJTD8i JztZY49mD9R1Dj9EGQMpEwuqUocTSCugNpmaGtWqeBAwCOUAclH8jHrhyRGY//GlXVRU 6EVhoyVl/urSl/5l3iIZ2GeFH8zsgIsWsnZ9fvaugTSDbSm9JbMLf5R07yvo+jD+kXTX w0+xAt1um8rbUHe9PEt7dlvkAHBoakDWbCHDCkPtVluYc4Ll+CkA4bzICV1VcT7UtpOC gpxWf4Ou2bnUsG5wf4JZ8WHYTDFN/M7DXLg3cK3msUDgo2icGLCu+3u6MKrlF+tVq1a2 2UFA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695838082; x=1696442882; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=5MbbJwVcDVc6ilMgG+HaZre9uVeLJcrMIu8nlaztDFM=; b=FhtO6OI0Gu2GQ2i+tlfuSQQsy1i2TbN/aC7C34dVLUlmQZlHJUNcnEmlv5Yb8U7GqM 2mw7zx43BzCqKeIsBpLmz/RyMY0dO/07jLoyD2Ls4V0ZKyA1dOpipv/I7seFxOE/Ni8t qh1yEhSoPiX0AglBlKiJL3cMGAFFaBkBmbH92yH59xDsxBvGeQmC7zNw5yPSyQfYG/mv IS6W/XSq2MwV/YImTpOf96bR5km290yLsigGrbt1y41NN/0l/KCyFhLfQS59b1pre9tm 3l9y13vJ6Fju8NOpMLwxh+/gFs9KaWrUnR5CPSfAV3Rq5FHI1uV5Wqt0BknyL+Ia0aoX Xpow== X-Gm-Message-State: AOJu0Yy+dB7VgTqoIXoLZosPOY03QmkO6KTkkPckOotJkcm7gC2RhTEX zzDSADzeQ/4V36mgOmZ2LdGL6//Bj8TEqswE+8IPSg== X-Google-Smtp-Source: AGHT+IEDzDtmtU4I5mGQOzPAeXdvSgidEo/k8Vc6fZQwvMn/VYLLkYv+U/mGg6INd5Seyp/KvxP+xm5ScvK16M+SCas= X-Received: by 2002:a81:85c1:0:b0:57a:cf8:5b4 with SMTP id v184-20020a8185c1000000b0057a0cf805b4mr2780885ywf.51.1695838081777; Wed, 27 Sep 2023 11:08:01 -0700 (PDT) MIME-Version: 1.0 References: <20230923013148.1390521-1-surenb@google.com> <20230923013148.1390521-3-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 27 Sep 2023 11:07:48 -0700 Message-ID: Subject: Re: [PATCH v2 2/3] userfaultfd: UFFDIO_REMAP uABI To: Jann Horn Cc: akpm@linux-foundation.org, viro@zeniv.linux.org.uk, brauner@kernel.org, shuah@kernel.org, aarcange@redhat.com, lokeshgidra@google.com, peterx@redhat.com, david@redhat.com, hughd@google.com, mhocko@suse.com, axelrasmussen@google.com, rppt@kernel.org, willy@infradead.org, Liam.Howlett@oracle.com, zhangpeng362@huawei.com, bgeffon@google.com, kaleshsingh@google.com, ngeoffray@google.com, jdduke@google.com, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, kernel-team@android.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 1361F40027 X-Rspam-User: X-Rspamd-Server: rspam11 X-Stat-Signature: egwz1w7d1xd3tokmnqbnmqb7w7i9ac9q X-HE-Tag: 1695838082-109639 X-HE-Meta: U2FsdGVkX1/t2/hUCrZyd3jFWhYFTRzmvo2oMMdRyO8ExxUyEmVjQ4wHUMH03dmBHjOi+BJ74pIOCxXL8jPoCybVyIsNscHDRcSrZfIie19buNtkwMC/HXKz46IITVjBPEwIFKZNHvQwK1bWuBOa8vLfcj5aWe1OV4NJFdR3Jx9gZNWv+YYS/SmNF4pWLzTcANJUKamnRqxcZ4JtuFm07JyY24UCJ2rm2FhYEkPi8BvPcCmECmc2xJhf18yKA2jTn1f3ezvi5CB5jY0fN1yZ27NBmkv+FM1W9gyxntC+36pELRQ4JNxAnORhxyb4Gm2Mk7lQXes3xVA0XrxBCXRZ54r5v3ULfhKBrPhb4BYrnCcdIdy9xClYR/kPs5uQ3q8QiPc9qtosesh6BWIunvqUlUH4YHf1IqoviGaS7zBzTAW3FDIebP+7RDCdjrSFLq+IeUu7Otyt7kc/S9oVtwg0/kzr/py/kZUzW5HjXmSDNr2O2HrcO2wU8A8xgi5jI4nkYCYOgqtQa7/LSxxiJ7Ymtt+lbUpTYQFmVHLMEQxtqd7oZf+qWNHJ1tB15zQbWh3LImtC0l8IsMlxqF/np54QEw48ekS4haXbHun1CO4Pfh6KyI8WZXIzrt+rUPIJXKT0k9btZ21kA7GSFQQJOMffG0NCBoGL8ZDEpbFaH7DVh3BEHiUcU/8rczOoRCPr5+1iSe9soz+VYD1iU5jlD8giyj+eSjjUeh4mO65knewWiiWYxmLfS/8ladhk9zrxiDcbPd9LnLoC/Ua9bxykbxm1e6/z1U+6zIYLWUQ12A/EXGboZR66WA2Bsi5u25j13QZ5KgZrOGNelPxvdAWym7hreRRLPSe7V7DEEBbPXx/SBH1BuLEoZFIwPU2lZgS2ZJWZUsrkRJnTB0WeRrMIPWRphmoJY6kwtdlKwdGoOtTbUpCpCideJJouBJ1VpzdflL49Dy76XOxUuUnevaJ5V4G c0TCTc9R 5uZya7nhjuCaSufAGUD2NXvXJxm3zkBk4ZOooJZkMfekuKlNgWZuWvpDfvpD1xiEC4qTEAsxwr0Cq4AdOcT/HN/pZo2WkKkNgiml50GVOv10+pfCPRVivtOAtqavTWzd2HCPtw1NOdxR5TFXiFVfhbUj4j1AMKtOrRPTpaFz25HOsIjQ8iP6fs5zU0kcugVcQaeckAdpD1cc+VggRk2puSAmK1rFpk4YkMNACWZhvMw9NrwvLHdECz0Viv6mdHBeXOlHhMJYIPAkhajcU966eikHMpgc0GIgPkzPcuKAoatQKSy6Bjvq0bYpv9VoDynnssLtx8nxNkTbMqoM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 27, 2023 at 5:47=E2=80=AFAM Jann Horn wrote: > > On Sat, Sep 23, 2023 at 3:31=E2=80=AFAM Suren Baghdasaryan wrote: > > From: Andrea Arcangeli > > > > This implements the uABI of UFFDIO_REMAP. > > > > Notably one mode bitflag is also forwarded (and in turn known) by the > > lowlevel remap_pages method. > > > > Signed-off-by: Andrea Arcangeli > > Signed-off-by: Suren Baghdasaryan > [...] > > +int remap_pages_huge_pmd(struct mm_struct *dst_mm, struct mm_struct *s= rc_mm, > > + pmd_t *dst_pmd, pmd_t *src_pmd, pmd_t dst_pmdv= al, > > + struct vm_area_struct *dst_vma, > > + struct vm_area_struct *src_vma, > > + unsigned long dst_addr, unsigned long src_addr= ) > > +{ > > + pmd_t _dst_pmd, src_pmdval; > > + struct page *src_page; > > + struct folio *src_folio; > > + struct anon_vma *src_anon_vma, *dst_anon_vma; > > + spinlock_t *src_ptl, *dst_ptl; > > + pgtable_t src_pgtable, dst_pgtable; > > + struct mmu_notifier_range range; > > + int err =3D 0; > > + > > + src_pmdval =3D *src_pmd; > > + src_ptl =3D pmd_lockptr(src_mm, src_pmd); > > + > > + BUG_ON(!spin_is_locked(src_ptl)); > > + mmap_assert_locked(src_mm); > > + mmap_assert_locked(dst_mm); > > + > > + BUG_ON(!pmd_trans_huge(src_pmdval)); > > + BUG_ON(!pmd_none(dst_pmdval)); > > + BUG_ON(src_addr & ~HPAGE_PMD_MASK); > > + BUG_ON(dst_addr & ~HPAGE_PMD_MASK); > > + > > + src_page =3D pmd_page(src_pmdval); > > + if (unlikely(!PageAnonExclusive(src_page))) { > > + spin_unlock(src_ptl); > > + return -EBUSY; > > + } > > + > > + src_folio =3D page_folio(src_page); > > + folio_get(src_folio); > > + spin_unlock(src_ptl); > > + > > + /* preallocate dst_pgtable if needed */ > > + if (dst_mm !=3D src_mm) { > > + dst_pgtable =3D pte_alloc_one(dst_mm); > > + if (unlikely(!dst_pgtable)) { > > + err =3D -ENOMEM; > > + goto put_folio; > > + } > > + } else { > > + dst_pgtable =3D NULL; > > + } > > + > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, sr= c_addr, > > + src_addr + HPAGE_PMD_SIZE); > > + mmu_notifier_invalidate_range_start(&range); > > + > > + /* block all concurrent rmap walks */ > > + folio_lock(src_folio); > > + > > + /* > > + * split_huge_page walks the anon_vma chain without the page > > + * lock. Serialize against it with the anon_vma lock, the page > > + * lock is not enough. > > + */ > > + src_anon_vma =3D folio_get_anon_vma(src_folio); > > + if (!src_anon_vma) { > > + err =3D -EAGAIN; > > + goto unlock_folio; > > + } > > + anon_vma_lock_write(src_anon_vma); > > + > > + dst_ptl =3D pmd_lockptr(dst_mm, dst_pmd); > > + double_pt_lock(src_ptl, dst_ptl); > > + if (unlikely(!pmd_same(*src_pmd, src_pmdval) || > > + !pmd_same(*dst_pmd, dst_pmdval) || > > + folio_mapcount(src_folio) !=3D 1)) { > > I think this is also supposed to be PageAnonExclusive()? Yes. Will fix. > > > + double_pt_unlock(src_ptl, dst_ptl); > > + err =3D -EAGAIN; > > + goto put_anon_vma; > > + } > > + > > + BUG_ON(!folio_test_head(src_folio)); > > + BUG_ON(!folio_test_anon(src_folio)); > > + > > + dst_anon_vma =3D (void *)dst_vma->anon_vma + PAGE_MAPPING_ANON; > > + WRITE_ONCE(src_folio->mapping, (struct address_space *) dst_ano= n_vma); > > + WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, dst_add= r)); > > + > > + src_pmdval =3D pmdp_huge_clear_flush(src_vma, src_addr, src_pmd= ); > > + _dst_pmd =3D mk_huge_pmd(&src_folio->page, dst_vma->vm_page_pro= t); > > + _dst_pmd =3D maybe_pmd_mkwrite(pmd_mkdirty(_dst_pmd), dst_vma); > > + set_pmd_at(dst_mm, dst_addr, dst_pmd, _dst_pmd); > > + > > + src_pgtable =3D pgtable_trans_huge_withdraw(src_mm, src_pmd); > > + if (dst_pgtable) { > > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, dst_pgtable= ); > > + pte_free(src_mm, src_pgtable); > > + dst_pgtable =3D NULL; > > + > > + mm_inc_nr_ptes(dst_mm); > > + mm_dec_nr_ptes(src_mm); > > + add_mm_counter(dst_mm, MM_ANONPAGES, HPAGE_PMD_NR); > > + add_mm_counter(src_mm, MM_ANONPAGES, -HPAGE_PMD_NR); > > + } else { > > + pgtable_trans_huge_deposit(dst_mm, dst_pmd, src_pgtable= ); > > + } > > + double_pt_unlock(src_ptl, dst_ptl); > > + > > +put_anon_vma: > > + anon_vma_unlock_write(src_anon_vma); > > + put_anon_vma(src_anon_vma); > > +unlock_folio: > > + /* unblock rmap walks */ > > + folio_unlock(src_folio); > > + mmu_notifier_invalidate_range_end(&range); > > + if (dst_pgtable) > > + pte_free(dst_mm, dst_pgtable); > > +put_folio: > > + folio_put(src_folio); > > + > > + return err; > > +} > > +#endif /* CONFIG_USERFAULTFD */ > [...] > > +static int remap_anon_pte(struct mm_struct *dst_mm, struct mm_struct *= src_mm, > > + struct vm_area_struct *dst_vma, > > + struct vm_area_struct *src_vma, > > + unsigned long dst_addr, unsigned long src_add= r, > > + pte_t *dst_pte, pte_t *src_pte, > > + pte_t orig_dst_pte, pte_t orig_src_pte, > > + spinlock_t *dst_ptl, spinlock_t *src_ptl, > > + struct folio *src_folio) > > +{ > > + struct anon_vma *dst_anon_vma; > > + > > + double_pt_lock(dst_ptl, src_ptl); > > + > > + if (!pte_same(*src_pte, orig_src_pte) || > > + !pte_same(*dst_pte, orig_dst_pte) || > > + folio_test_large(src_folio) || > > + folio_estimated_sharers(src_folio) !=3D 1) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > + > > + BUG_ON(!folio_test_anon(src_folio)); > > + > > + dst_anon_vma =3D (void *)dst_vma->anon_vma + PAGE_MAPPING_ANON; > > + WRITE_ONCE(src_folio->mapping, > > + (struct address_space *) dst_anon_vma); > > + WRITE_ONCE(src_folio->index, linear_page_index(dst_vma, > > + dst_addr)); > > + > > + orig_src_pte =3D ptep_clear_flush(src_vma, src_addr, src_pte); > > + orig_dst_pte =3D mk_pte(&src_folio->page, dst_vma->vm_page_prot= ); > > + orig_dst_pte =3D maybe_mkwrite(pte_mkdirty(orig_dst_pte), > > + dst_vma); > > I think there's still a theoretical issue here that you could fix by > checking for the AnonExclusive flag, similar to the huge page case. > > Consider the following scenario: > > 1. process P1 does a write fault in a private anonymous VMA, creating > and mapping a new anonymous page A1 > 2. process P1 forks and creates two children P2 and P3. afterwards, A1 > is mapped in P1, P2 and P3 as a COW page, with mapcount 3. > 3. process P1 removes its mapping of A1, dropping its mapcount to 2. > 4. process P2 uses vmsplice() to grab a reference to A1 with get_user_pag= es() > 5. process P2 removes its mapping of A1, dropping its mapcount to 1. > > If at this point P3 does a write fault on its mapping of A1, it will > still trigger copy-on-write thanks to the AnonExclusive mechanism; and > this is necessary to avoid P3 mapping A1 as writable and writing data > into it that will become visible to P2, if P2 and P3 are in different > security contexts. > > But if P3 instead moves its mapping of A1 to another address with > remap_anon_pte() which only does a page mapcount check, the > maybe_mkwrite() will directly make the mapping writable, circumventing > the AnonExclusive mechanism. I see. Thanks for the detailed explanation! I will add PageAnonExclusive() check in this path to prevent this scenario. > > > + set_pte_at(dst_mm, dst_addr, dst_pte, orig_dst_pte); > > + > > + if (dst_mm !=3D src_mm) { > > + inc_mm_counter(dst_mm, MM_ANONPAGES); > > + dec_mm_counter(src_mm, MM_ANONPAGES); > > + } > > + > > + double_pt_unlock(dst_ptl, src_ptl); > > + > > + return 0; > > +} > > + > > +static int remap_swap_pte(struct mm_struct *dst_mm, struct mm_struct *= src_mm, > > + unsigned long dst_addr, unsigned long src_add= r, > > + pte_t *dst_pte, pte_t *src_pte, > > + pte_t orig_dst_pte, pte_t orig_src_pte, > > + spinlock_t *dst_ptl, spinlock_t *src_ptl) > > +{ > > + if (!pte_swp_exclusive(orig_src_pte)) > > + return -EBUSY; > > + > > + double_pt_lock(dst_ptl, src_ptl); > > + > > + if (!pte_same(*src_pte, orig_src_pte) || > > + !pte_same(*dst_pte, orig_dst_pte)) { > > + double_pt_unlock(dst_ptl, src_ptl); > > + return -EAGAIN; > > + } > > + > > + orig_src_pte =3D ptep_get_and_clear(src_mm, src_addr, src_pte); > > + set_pte_at(dst_mm, dst_addr, dst_pte, orig_src_pte); > > + > > + if (dst_mm !=3D src_mm) { > > + inc_mm_counter(dst_mm, MM_ANONPAGES); > > + dec_mm_counter(src_mm, MM_ANONPAGES); > > I think this is the wrong counter. Looking at zap_pte_range(), in the > "Genuine swap entry" case, we modify the MM_SWAPENTS counter, not > MM_ANONPAGES. Oops, my bad. Will fix. > > > + } > > + > > + double_pt_unlock(dst_ptl, src_ptl); > > + > > + return 0; > > +} > > + > > +/* > > + * The mmap_lock for reading is held by the caller. Just move the page > > + * from src_pmd to dst_pmd if possible, and return true if succeeded > > + * in moving the page. > > + */ > > +static int remap_pages_pte(struct mm_struct *dst_mm, > > + struct mm_struct *src_mm, > > + pmd_t *dst_pmd, > > + pmd_t *src_pmd, > > + struct vm_area_struct *dst_vma, > > + struct vm_area_struct *src_vma, > > + unsigned long dst_addr, > > + unsigned long src_addr, > > + __u64 mode) > > +{ > > + swp_entry_t entry; > > + pte_t orig_src_pte, orig_dst_pte; > > + spinlock_t *src_ptl, *dst_ptl; > > + pte_t *src_pte =3D NULL; > > + pte_t *dst_pte =3D NULL; > > + > > + struct folio *src_folio =3D NULL; > > + struct anon_vma *src_anon_vma =3D NULL; > > + struct mmu_notifier_range range; > > + int err =3D 0; > > + > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, > > + src_addr, src_addr + PAGE_SIZE); > > + mmu_notifier_invalidate_range_start(&range); > > +retry: > > This retry looks a bit dodgy. On this retry label, we restart almost > the entire operation, including re-reading the source PTE; the only > variables that carry state forward from the previous retry loop > iteration are src_folio and src_anon_vma. > > > + dst_pte =3D pte_offset_map_nolock(dst_mm, dst_pmd, dst_addr, &d= st_ptl); > > + > > + /* If an huge pmd materialized from under us fail */ > > + if (unlikely(!dst_pte)) { > > + err =3D -EFAULT; > > + goto out; > > + } > [...] > > + spin_lock(dst_ptl); > > + orig_dst_pte =3D *dst_pte; > > + spin_unlock(dst_ptl); > > + if (!pte_none(orig_dst_pte)) { > > + err =3D -EEXIST; > > + goto out; > > + } > > + > > + spin_lock(src_ptl); > > + orig_src_pte =3D *src_pte; > > Here we read an entirely new orig_src_pte value. Something like a > concurrent MADV_DONTNEED+pagefault could have made the PTE point to a > different page between loop iterations. > > > + spin_unlock(src_ptl); > > I think you have to insert something like the following here: > > if (src_folio && (orig_dst_pte !=3D previous_src_pte)) { > err =3D -EAGAIN; > goto out; > } > previous_src_pte =3D orig_dst_pte; Yes, this definitely needs to be rechecked. Will fix. > > Otherwise: > > > + if (pte_none(orig_src_pte)) { > > + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) > > + err =3D -ENOENT; > > + else /* nothing to do to remap a hole */ > > + err =3D 0; > > + goto out; > > + } > > + > > + if (pte_present(orig_src_pte)) { > > + /* > > + * Pin and lock both source folio and anon_vma. Since w= e are in > > + * RCU read section, we can't block, so on contention h= ave to > > + * unmap the ptes, obtain the lock and retry. > > + */ > > + if (!src_folio) { > > If we already found a src_folio in the previous iteration (but the > trylock failed), we keep using the same src_folio without rechecking > if the current PTE still points to the same folio. > > > + struct folio *folio; > > + > > + /* > > + * Pin the page while holding the lock to be su= re the > > + * page isn't freed under us > > + */ > > + spin_lock(src_ptl); > > + if (!pte_same(orig_src_pte, *src_pte)) { > > + spin_unlock(src_ptl); > > + err =3D -EAGAIN; > > + goto out; > > + } > > + > > + folio =3D vm_normal_folio(src_vma, src_addr, or= ig_src_pte); > > + if (!folio || !folio_test_anon(folio) || > > + folio_test_large(folio) || > > + folio_estimated_sharers(folio) !=3D 1) { > > + spin_unlock(src_ptl); > > + err =3D -EBUSY; > > + goto out; > > + } > > + > > + folio_get(folio); > > + src_folio =3D folio; > > + spin_unlock(src_ptl); > > + > > + /* block all concurrent rmap walks */ > > + if (!folio_trylock(src_folio)) { > > + pte_unmap(&orig_src_pte); > > + pte_unmap(&orig_dst_pte); > > + src_pte =3D dst_pte =3D NULL; > > + /* now we can block and wait */ > > + folio_lock(src_folio); > > + goto retry; > > + } > > + } > > + > > + if (!src_anon_vma) { > > (And here, if we already saw a src_anon_vma but the trylock failed, > we'll keep using that src_anon_vma.) Ack. The check for previous_src_pte should handle that as well. > > > + /* > > + * folio_referenced walks the anon_vma chain > > + * without the folio lock. Serialize against it= with > > + * the anon_vma lock, the folio lock is not eno= ugh. > > + */ > > + src_anon_vma =3D folio_get_anon_vma(src_folio); > > + if (!src_anon_vma) { > > + /* page was unmapped from under us */ > > + err =3D -EAGAIN; > > + goto out; > > + } > > + if (!anon_vma_trylock_write(src_anon_vma)) { > > + pte_unmap(&orig_src_pte); > > + pte_unmap(&orig_dst_pte); > > + src_pte =3D dst_pte =3D NULL; > > + /* now we can block and wait */ > > + anon_vma_lock_write(src_anon_vma); > > + goto retry; > > + } > > + } > > So at this point we have: > > - the current src_pte > - some referenced+locked src_folio that used to be mapped exclusively > at src_addr > - (the anon_vma associated with the src_folio) > > > + err =3D remap_anon_pte(dst_mm, src_mm, dst_vma, src_vm= a, > > + dst_addr, src_addr, dst_pte, src_p= te, > > + orig_dst_pte, orig_src_pte, > > + dst_ptl, src_ptl, src_folio); > > And then this will, without touching folio mapcounts/refcounts, delete > the current PTE at src_addr, and create a PTE at dst_addr pointing to > the old src_folio, leading to incorrect refcounts/mapcounts? I assume this still points to the missing previous_src_pte check discussed in the previous comments. Is that correct or is there yet another issue? > > > + } else { > [...] > > + } > > + > > +out: > > + if (src_anon_vma) { > > + anon_vma_unlock_write(src_anon_vma); > > + put_anon_vma(src_anon_vma); > > + } > > + if (src_folio) { > > + folio_unlock(src_folio); > > + folio_put(src_folio); > > + } > > + if (dst_pte) > > + pte_unmap(dst_pte); > > + if (src_pte) > > + pte_unmap(src_pte); > > + mmu_notifier_invalidate_range_end(&range); > > + > > + return err; > > +} > [...] > > +ssize_t remap_pages(struct mm_struct *dst_mm, struct mm_struct *src_mm= , > > + unsigned long dst_start, unsigned long src_start, > > + unsigned long len, __u64 mode) > > +{ > > + struct vm_area_struct *src_vma, *dst_vma; > > + unsigned long src_addr, dst_addr; > > + pmd_t *src_pmd, *dst_pmd; > > + long err =3D -EINVAL; > > + ssize_t moved =3D 0; > > + > > + /* > > + * Sanitize the command parameters: > > + */ > > + BUG_ON(src_start & ~PAGE_MASK); > > + BUG_ON(dst_start & ~PAGE_MASK); > > + BUG_ON(len & ~PAGE_MASK); > > + > > + /* Does the address range wrap, or is the span zero-sized? */ > > + BUG_ON(src_start + len <=3D src_start); > > + BUG_ON(dst_start + len <=3D dst_start); > > + > > + /* > > + * Because these are read sempahores there's no risk of lock > > + * inversion. > > + */ > > + mmap_read_lock(dst_mm); > > + if (dst_mm !=3D src_mm) > > + mmap_read_lock(src_mm); > > + > > + /* > > + * Make sure the vma is not shared, that the src and dst remap > > + * ranges are both valid and fully within a single existing > > + * vma. > > + */ > > + src_vma =3D find_vma(src_mm, src_start); > > + if (!src_vma || (src_vma->vm_flags & VM_SHARED)) > > + goto out; > > + if (src_start < src_vma->vm_start || > > + src_start + len > src_vma->vm_end) > > + goto out; > > + > > + dst_vma =3D find_vma(dst_mm, dst_start); > > + if (!dst_vma || (dst_vma->vm_flags & VM_SHARED)) > > + goto out; > > + if (dst_start < dst_vma->vm_start || > > + dst_start + len > dst_vma->vm_end) > > + goto out; > > + > > + err =3D validate_remap_areas(src_vma, dst_vma); > > + if (err) > > + goto out; > > + > > + for (src_addr =3D src_start, dst_addr =3D dst_start; > > + src_addr < src_start + len;) { > > + spinlock_t *ptl; > > + pmd_t dst_pmdval; > > + unsigned long step_size; > > + > > + BUG_ON(dst_addr >=3D dst_start + len); > > + /* > > + * Below works because anonymous area would not have a > > + * transparent huge PUD. If file-backed support is adde= d, > > + * that case would need to be handled here. > > + */ > > + src_pmd =3D mm_find_pmd(src_mm, src_addr); > > + if (unlikely(!src_pmd)) { > > + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)= ) { > > + err =3D -ENOENT; > > + break; > > + } > > + src_pmd =3D mm_alloc_pmd(src_mm, src_addr); > > + if (unlikely(!src_pmd)) { > > + err =3D -ENOMEM; > > + break; > > + } > > + } > > + dst_pmd =3D mm_alloc_pmd(dst_mm, dst_addr); > > + if (unlikely(!dst_pmd)) { > > + err =3D -ENOMEM; > > + break; > > + } > > + > > + dst_pmdval =3D pmdp_get_lockless(dst_pmd); > > + /* > > + * If the dst_pmd is mapped as THP don't override it an= d just > > + * be strict. If dst_pmd changes into TPH after this ch= eck, the > > + * remap_pages_huge_pmd() will detect the change and re= try > > + * while remap_pages_pte() will detect the change and f= ail. > > + */ > > + if (unlikely(pmd_trans_huge(dst_pmdval))) { > > + err =3D -EEXIST; > > + break; > > + } > > + > > + ptl =3D pmd_trans_huge_lock(src_pmd, src_vma); > > + if (ptl && !pmd_trans_huge(*src_pmd)) { > > + spin_unlock(ptl); > > + ptl =3D NULL; > > + } > > This still looks wrong - we do still have to split_huge_pmd() > somewhere so that remap_pages_pte() works. Hmm, I guess this extra check is not even needed... > > > + if (ptl) { > > + /* > > + * Check if we can move the pmd without > > + * splitting it. First check the address > > + * alignment to be the same in src/dst. These > > + * checks don't actually need the PT lock but > > + * it's good to do it here to optimize this > > + * block away at build time if > > + * CONFIG_TRANSPARENT_HUGEPAGE is not set. > > + */ > > + if ((src_addr & ~HPAGE_PMD_MASK) || (dst_addr &= ~HPAGE_PMD_MASK) || > > + src_start + len - src_addr < HPAGE_PMD_SIZE= || !pmd_none(dst_pmdval)) { > > + spin_unlock(ptl); > > + split_huge_pmd(src_vma, src_pmd, src_ad= dr); > > + continue; > > + } > > + > > + err =3D remap_pages_huge_pmd(dst_mm, src_mm, > > + dst_pmd, src_pmd, > > + dst_pmdval, > > + dst_vma, src_vma, > > + dst_addr, src_addr); > > + step_size =3D HPAGE_PMD_SIZE; > > + } else { > > + if (pmd_none(*src_pmd)) { > > + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SR= C_HOLES)) { > > + err =3D -ENOENT; > > + break; > > + } > > + if (unlikely(__pte_alloc(src_mm, src_pm= d))) { > > + err =3D -ENOMEM; > > + break; > > + } > > + } > > + > > + if (unlikely(pte_alloc(dst_mm, dst_pmd))) { > > + err =3D -ENOMEM; > > + break; > > + } > > + > > + err =3D remap_pages_pte(dst_mm, src_mm, > > + dst_pmd, src_pmd, > > + dst_vma, src_vma, > > + dst_addr, src_addr, > > + mode); > > + step_size =3D PAGE_SIZE; > > + } > > + > > + cond_resched(); > > + > > + if (!err) { > > + dst_addr +=3D step_size; > > + src_addr +=3D step_size; > > + moved +=3D step_size; > > + } > > + > > + if ((!err || err =3D=3D -EAGAIN) && > > + fatal_signal_pending(current)) > > + err =3D -EINTR; > > + > > + if (err && err !=3D -EAGAIN) > > + break; > > + } > > + > > +out: > > + mmap_read_unlock(dst_mm); > > + if (dst_mm !=3D src_mm) > > + mmap_read_unlock(src_mm); > > + BUG_ON(moved < 0); > > + BUG_ON(err > 0); > > + BUG_ON(!moved && !err); > > + return moved ? moved : err; > > +} > > -- > > 2.42.0.515.g380fc7ccd1-goog > >