From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 57669E82CB2 for ; Wed, 27 Sep 2023 17:12:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id CB54A8D009A; Wed, 27 Sep 2023 13:12:43 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id C3F3F8D0093; Wed, 27 Sep 2023 13:12:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A91E48D009A; Wed, 27 Sep 2023 13:12:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 912C78D0093 for ; Wed, 27 Sep 2023 13:12:43 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 5A851C0DCF for ; Wed, 27 Sep 2023 17:12:43 +0000 (UTC) X-FDA: 81283021806.15.BC73C0B Received: from mail-yw1-f175.google.com (mail-yw1-f175.google.com [209.85.128.175]) by imf04.hostedemail.com (Postfix) with ESMTP id 7F4A54001D for ; Wed, 27 Sep 2023 17:12:41 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K6AiBmNE; spf=pass (imf04.hostedemail.com: domain of surenb@google.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1695834761; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=LvUAki+h0o6OBk2tbEzSZshv+nhr9s6eBcOiPnVYGzk=; b=jmq0kNZrPinoAooIJwCT3yeVHoKCac4uYbVU8EsqZIpd5KPkI2CdZotBoxJryRotuUrEWt J426VtMiJ5iKIPCUBTmTNudY16m0r2UXuosFR3pJfANcPbx4TTPJYT1SYggSdq+wkqJ7ft vjKfFRE6Gw707Dd1tAZ8LmTXXm4jhtA= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=K6AiBmNE; spf=pass (imf04.hostedemail.com: domain of surenb@google.com designates 209.85.128.175 as permitted sender) smtp.mailfrom=surenb@google.com; dmarc=pass (policy=reject) header.from=google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1695834761; a=rsa-sha256; cv=none; b=SLHMbnee5DBwCX0k3YmIBo8q8Evtb+iBXhEEcWT3/MnzOYC6u0RyA+S7js7X7LMSgalOvH 2kDFKUp1eRQnoYLaSGpUZa8Dsu3k0xdGc0SuWAjimN64OaL8asEyAduv9iRzQ5+WJy8ujB uXMkc8P/Yk7WNylwcsWkTRKndBXTSvQ= Received: by mail-yw1-f175.google.com with SMTP id 00721157ae682-59f6e6b206fso93024307b3.3 for ; Wed, 27 Sep 2023 10:12:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1695834760; x=1696439560; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=LvUAki+h0o6OBk2tbEzSZshv+nhr9s6eBcOiPnVYGzk=; b=K6AiBmNEo+vMjztQbExSaZMRo3vLS8XwrjMO+C1vSh58oYfp3jY65iTVkGCEeyRMsu AVUc6cwxXTaVsbTHVCvxJEZEzcH8JmlymJVFHeexdgbGuug4enGFm1VgSdUndov5MTqG 5yil9BrcjsWMSyZ5/CIhyNvYXi0hbm6b5zot6sdUSeY3KxJ+D87dV6CSGW3m5277HalO /La91/FkeEXYaaFzcVRuZ9Wf17m9+95UoqEH6/KpPQvKdHOFK51oI7+nAPvpicn+nrEL oGbrMAQuT8QQ2XRkJB3lfDkb+38srtoX3RxUqzM8RAGZix8vAbbF+43t2UwYvSbuNv65 RcLw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1695834760; x=1696439560; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=LvUAki+h0o6OBk2tbEzSZshv+nhr9s6eBcOiPnVYGzk=; b=vAq2xloA1T5SP4mcsS0IUJUFLYsN+LK4/BbPcv2IRqeHS9n//EGesuCSWwnh03Gs7d FgosC+VS62xmjibbVrxCKWU6CxNYcVKi5+SepC9iQ0VSSOqFCBl7LT5hFKiiYxc4j2fW NlukdJHjJzXi5w519Iiv8QQMiHyQU4zEqQWfaXk3jESX5pAP8a0tUzEDADjcFBw7hNQ3 OQNCSgVkIqeAFFvMNtJuDPnO/cx8L5ASvvb10jdIHqUGPZELKf8/g+VnU5nBn9y+uosX dxmqX9mkiJtJS5nIMj0ptVJ2NFJM3MoeEdjS/Zc90MejDXJX8B1/PZkrnEKtCTqcVPfv Dc1g== X-Gm-Message-State: AOJu0YyaCTA++VE9oinjTdD/Gi0yOE/twqT9sygrG/20Fnv1F9hDeErX kYoIFwtoz4eCsd/FOFeLEp5m6C3Qas9leD7CZf98wA== X-Google-Smtp-Source: AGHT+IGusg+pz3SyKBvFMux72qCFiV8POf7w4/mSCv+hhQ1tzbJGLxBY6WVWDQ1aP4eo9wuYiU9oJLvakvRs8U6BLB8= X-Received: by 2002:a25:cacd:0:b0:d81:c10d:7e1a with SMTP id a196-20020a25cacd000000b00d81c10d7e1amr2776442ybg.58.1695834760324; Wed, 27 Sep 2023 10:12:40 -0700 (PDT) MIME-Version: 1.0 References: <20230923013148.1390521-1-surenb@google.com> <20230923013148.1390521-3-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 27 Sep 2023 10:12:28 -0700 Message-ID: Subject: Re: potential new userfaultfd vs khugepaged conflict [was: Re: [PATCH v2 2/3] userfaultfd: UFFDIO_REMAP uABI] To: Jann Horn Cc: Hugh Dickins , Andrew Morton , Al Viro , brauner@kernel.org, Shuah Khan , Andrea Arcangeli , Lokesh Gidra , Peter Xu , David Hildenbrand , Michal Hocko , Axel Rasmussen , Mike Rapoport , willy@infradead.org, Liam.Howlett@oracle.com, zhangpeng362@huawei.com, Brian Geffon , Kalesh Singh , Nicolas Geoffray , Jared Duke , Linux-MM , linux-fsdevel , kernel list , "open list:KERNEL SELFTEST FRAMEWORK" , kernel-team Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 7F4A54001D X-Rspam-User: X-Stat-Signature: yxbqthfhm6f3hpwdep6pqg9rbes57seb X-Rspamd-Server: rspam01 X-HE-Tag: 1695834761-119522 X-HE-Meta: U2FsdGVkX1/BYgY7Yz0EPTpcDgAuzGAEa/OwUK7SJ0qyDHtpbNxiknBoGwpR3wXVzwah8qTRITU3jqUOHWr1tBrvzTCk63sIOy7L7RArcM+Qwqj0xjH4uWgLJLqOI6jdsxFiUZz8nKwZNjYlR6Pdn7JryBSYrUeoMKcH553t/A3eM7w/3XhXCCpbYwVS9y+wVjvpiZ2jyR17f+IW4xJYDDoQHfPSX+jFgjxKTryFjwcA0L22dgZ43LatBTdoLUbpiOXINzQAjVE83bcYBRJe3IumWAuyMa5IVYWGZUqniSfLqSE2KnhEmNXT52w/+7BNveklwTfg32i8zdAYeKGs9ZjV5rmkm2z4msBGHOlKI3rioXYecRABA13H68NxnbjBSQs44IhBXVBy9E+PfIGbD9xgOeYpkFpvU82PqQgxs9cs7ol3KjpYVFs8nVlUUphMx+Lx4jUMpnR8saULM+uPi5K5XS1pcELjnaA8ZSczjBIyiQJZNutl2namDOV3i4l3owVg1q5zjGciJydY2CuQOJ9cNDpQzNXOKxEgve87inzkNV2lDzPtgmQKKXRI7y5jKzpFCTuxS/CiBa6LhuDX4aB66SMh+OqrW5dy4xSybQPmL+kD4J4NRpmK3hyeufV/64Y6f7uSnT16Kln77+ueOk3xVk+uARpYFEQ7qTvOJnXGAgzqwoXQk21ScwNiBXA9YJg53q+oIQtBjWX9+RPboIZodnHAQWx+YRgB6HDUAzJYfl1GlgVADCZ+386WXWF4KFx8SLnyzOfpxnBHx4rc1norc/nBR6cnI6jkT4Twbr+QULfNbrOigaHrPu+gWlngrrTItXR8vmcb5kFFxyNtLxkkP8AyoLmQAM9aTN4rCxcrLgYrHFPSiO0rZzvGCyoBB473c/4k4ruOjRMWml/c6p7vom3p9XhgApUaeBpxoel8dyl0oIe+pScd9FNrgkk6lDuEXCJ/u9wYc+BGU19 kQ/nzIbg c4qbzbfyZDgG/oG3EpLxSmVgEkHaHiFQfAKB1+ttknuVCPrFNSWcHqrzMLHl1aoin2BwKr5iAVHZnK+kyQrlr3/Vx8MuxfUQClHOtXOisfNc4E4lh8BgaIy0FtBdXn5L8+D5HtU7zoLnJESx47oQJKEWDdxpGp5aFWRSW3j1sZFk4a/aciiR1z3PJWdp3xW6fFIl3FgVSDLmSA8ozShiUtHtYSA67zI5+OHF8PIzOwWNzkOthQ2N/wEBlQ4L/u6+/Q5GgGLOTsaChgDdPxmtSBAaTI5Du+AfWKKgDgudtBds3KdHRB9ydb9flTN89l4Ft6jYAbPCLpKXtuU4= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 27, 2023 at 3:07=E2=80=AFAM Jann Horn wrote: > > [moving Hugh into "To:" recipients as FYI for khugepaged interaction] > > On Sat, Sep 23, 2023 at 3:31=E2=80=AFAM Suren Baghdasaryan wrote: > > From: Andrea Arcangeli > > > > This implements the uABI of UFFDIO_REMAP. > > > > Notably one mode bitflag is also forwarded (and in turn known) by the > > lowlevel remap_pages method. > > > > Signed-off-by: Andrea Arcangeli > > Signed-off-by: Suren Baghdasaryan > [...] > > +/* > > + * The mmap_lock for reading is held by the caller. Just move the page > > + * from src_pmd to dst_pmd if possible, and return true if succeeded > > + * in moving the page. > > + */ > > +static int remap_pages_pte(struct mm_struct *dst_mm, > > + struct mm_struct *src_mm, > > + pmd_t *dst_pmd, > > + pmd_t *src_pmd, > > + struct vm_area_struct *dst_vma, > > + struct vm_area_struct *src_vma, > > + unsigned long dst_addr, > > + unsigned long src_addr, > > + __u64 mode) > > +{ > > + swp_entry_t entry; > > + pte_t orig_src_pte, orig_dst_pte; > > + spinlock_t *src_ptl, *dst_ptl; > > + pte_t *src_pte =3D NULL; > > + pte_t *dst_pte =3D NULL; > > + > > + struct folio *src_folio =3D NULL; > > + struct anon_vma *src_anon_vma =3D NULL; > > + struct mmu_notifier_range range; > > + int err =3D 0; > > + > > + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, src_mm, > > + src_addr, src_addr + PAGE_SIZE); > > + mmu_notifier_invalidate_range_start(&range); > > +retry: > > + dst_pte =3D pte_offset_map_nolock(dst_mm, dst_pmd, dst_addr, &d= st_ptl); > > + > > + /* If an huge pmd materialized from under us fail */ > > + if (unlikely(!dst_pte)) { > > + err =3D -EFAULT; > > + goto out; > > + } > > + > > + src_pte =3D pte_offset_map_nolock(src_mm, src_pmd, src_addr, &s= rc_ptl); > > + > > + /* > > + * We held the mmap_lock for reading so MADV_DONTNEED > > + * can zap transparent huge pages under us, or the > > + * transparent huge page fault can establish new > > + * transparent huge pages under us. > > + */ > > + if (unlikely(!src_pte)) { > > + err =3D -EFAULT; > > + goto out; > > + } > > + > > + BUG_ON(pmd_none(*dst_pmd)); > > + BUG_ON(pmd_none(*src_pmd)); > > + BUG_ON(pmd_trans_huge(*dst_pmd)); > > + BUG_ON(pmd_trans_huge(*src_pmd)); > > This works for now, but note that Hugh Dickins has recently been > reworking khugepaged such that PTE-based mappings can be collapsed > into transhuge mappings under the mmap lock held in *read mode*; > holders of the mmap lock in read mode can only synchronize against > this by taking the right page table spinlock and rechecking the pmd > value. This is only the case for file-based mappings so far, not for > anonymous private VMAs; and this code only operates on anonymous > private VMAs so far, so it works out. > > But if either Hugh further reworks khugepaged such that anonymous VMAs > can be collapsed under the mmap lock in read mode, or you expand this > code to work on file-backed VMAs, then it will become possible to hit > these BUG_ON() calls. I'm not sure what the plans for khugepaged going > forward are, but the number of edgecases everyone has to keep in mind > would go down if you changed this function to deal gracefully with > page tables disappearing under you. > > In the newest version of mm/pgtable-generic.c, above > __pte_offset_map_lock(), there is a big comment block explaining the > current rules for page table access; in particular, regarding the > helper pte_offset_map_nolock() that you're using: > > * pte_offset_map_nolock(mm, pmd, addr, ptlp), above, is like pte_offset_= map(); > * but when successful, it also outputs a pointer to the spinlock in ptlp= - as > * pte_offset_map_lock() does, but in this case without locking it. This= helps > * the caller to avoid a later pte_lockptr(mm, *pmd), which might by that= time > * act on a changed *pmd: pte_offset_map_nolock() provides the correct sp= inlock > * pointer for the page table that it returns. In principle, the caller = should > * recheck *pmd once the lock is taken; in practice, no callsite needs th= at - > * either the mmap_lock for write, or pte_same() check on contents, is en= ough. > > If this becomes hittable in the future, I think you will need to > recheck *pmd, at least for dst_pte, to avoid copying PTEs into a > detached page table. Thanks for the warning, Jann. It sounds to me it would be better to add this pmd check now even though it's not hittable. Does that sound good to everyone? > > > + spin_lock(dst_ptl); > > + orig_dst_pte =3D *dst_pte; > > + spin_unlock(dst_ptl); > > + if (!pte_none(orig_dst_pte)) { > > + err =3D -EEXIST; > > + goto out; > > + } > > + > > + spin_lock(src_ptl); > > + orig_src_pte =3D *src_pte; > > + spin_unlock(src_ptl); > > + if (pte_none(orig_src_pte)) { > > + if (!(mode & UFFDIO_REMAP_MODE_ALLOW_SRC_HOLES)) > > + err =3D -ENOENT; > > + else /* nothing to do to remap a hole */ > > + err =3D 0; > > + goto out; > > + } > > + > > + if (pte_present(orig_src_pte)) { > > + /* > > + * Pin and lock both source folio and anon_vma. Since w= e are in > > + * RCU read section, we can't block, so on contention h= ave to > > + * unmap the ptes, obtain the lock and retry. > > + */ > > + if (!src_folio) { > > + struct folio *folio; > > + > > + /* > > + * Pin the page while holding the lock to be su= re the > > + * page isn't freed under us > > + */ > > + spin_lock(src_ptl); > > + if (!pte_same(orig_src_pte, *src_pte)) { > > + spin_unlock(src_ptl); > > + err =3D -EAGAIN; > > + goto out; > > + } > > + > > + folio =3D vm_normal_folio(src_vma, src_addr, or= ig_src_pte); > > + if (!folio || !folio_test_anon(folio) || > > + folio_test_large(folio) || > > + folio_estimated_sharers(folio) !=3D 1) { > > + spin_unlock(src_ptl); > > + err =3D -EBUSY; > > + goto out; > > + } > > + > > + folio_get(folio); > > + src_folio =3D folio; > > + spin_unlock(src_ptl); > > + > > + /* block all concurrent rmap walks */ > > + if (!folio_trylock(src_folio)) { > > + pte_unmap(&orig_src_pte); > > + pte_unmap(&orig_dst_pte); > > + src_pte =3D dst_pte =3D NULL; > > + /* now we can block and wait */ > > + folio_lock(src_folio); > > + goto retry; > > + } > > + } > > + > > + if (!src_anon_vma) { > > + /* > > + * folio_referenced walks the anon_vma chain > > + * without the folio lock. Serialize against it= with > > + * the anon_vma lock, the folio lock is not eno= ugh. > > + */ > > + src_anon_vma =3D folio_get_anon_vma(src_folio); > > + if (!src_anon_vma) { > > + /* page was unmapped from under us */ > > + err =3D -EAGAIN; > > + goto out; > > + } > > + if (!anon_vma_trylock_write(src_anon_vma)) { > > + pte_unmap(&orig_src_pte); > > + pte_unmap(&orig_dst_pte); > > + src_pte =3D dst_pte =3D NULL; > > + /* now we can block and wait */ > > + anon_vma_lock_write(src_anon_vma); > > + goto retry; > > + } > > + } > > + > > + err =3D remap_anon_pte(dst_mm, src_mm, dst_vma, src_vm= a, > > + dst_addr, src_addr, dst_pte, src_p= te, > > + orig_dst_pte, orig_src_pte, > > + dst_ptl, src_ptl, src_folio); > > + } else { > > + entry =3D pte_to_swp_entry(orig_src_pte); > > + if (non_swap_entry(entry)) { > > + if (is_migration_entry(entry)) { > > + pte_unmap(&orig_src_pte); > > + pte_unmap(&orig_dst_pte); > > + src_pte =3D dst_pte =3D NULL; > > + migration_entry_wait(src_mm, src_pmd, > > + src_addr); > > + err =3D -EAGAIN; > > + } else > > + err =3D -EFAULT; > > + goto out; > > + } > > + > > + err =3D remap_swap_pte(dst_mm, src_mm, dst_addr, src_ad= dr, > > + dst_pte, src_pte, > > + orig_dst_pte, orig_src_pte, > > + dst_ptl, src_ptl); > > + } > > + > > +out: > > + if (src_anon_vma) { > > + anon_vma_unlock_write(src_anon_vma); > > + put_anon_vma(src_anon_vma); > > + } > > + if (src_folio) { > > + folio_unlock(src_folio); > > + folio_put(src_folio); > > + } > > + if (dst_pte) > > + pte_unmap(dst_pte); > > + if (src_pte) > > + pte_unmap(src_pte); > > + mmu_notifier_invalidate_range_end(&range); > > + > > + return err; > > +}