From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2946DEE49A6 for ; Tue, 22 Aug 2023 02:51:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9DBAA940026; Mon, 21 Aug 2023 22:51:53 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 98B4E94000D; Mon, 21 Aug 2023 22:51:53 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8050C940026; Mon, 21 Aug 2023 22:51:53 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 6F24094000D for ; Mon, 21 Aug 2023 22:51:53 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 50321160192 for ; Tue, 22 Aug 2023 02:51:53 +0000 (UTC) X-FDA: 81150215706.30.CF9BCA6 Received: from mail-yb1-f181.google.com (mail-yb1-f181.google.com [209.85.219.181]) by imf24.hostedemail.com (Postfix) with ESMTP id 7F6D2180012 for ; Tue, 22 Aug 2023 02:51:51 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Q0qcwIyC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of hughd@google.com designates 209.85.219.181 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692672711; a=rsa-sha256; cv=none; b=G6Ij6y3ntL//AQ7cR1hvuczP8Tt9x5Fl7PJ42siUEYPyHZJ58JqnaajAYO86JLP6LVSPh4 pQPY9s5sz4+QscS7RApbCvZDW5PyAPcJxG/pso4CvMO1DWw/5vPTXc12bttv4yNIjM9+sD iu9FLzc32PSm9nV8T8s4djFvSsZeI+s= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=Q0qcwIyC; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf24.hostedemail.com: domain of hughd@google.com designates 209.85.219.181 as permitted sender) smtp.mailfrom=hughd@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692672711; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=BD58/XthciO0sTCaxjR+dsg5ZYNZ/J3nmIWihGISY24=; b=YejyxrdmaFJ5N7V32ZpJtLzEz7Sqt9j1d8lCfOQbSGT0tDYWHNGkNlpG5SEEReoXxuNsWh Edtd4btKBq2SH1+KelvhG9AptAyi7wRhtGl1r6ZlYfDynCrh29Y2whFhi/OWc2DU9OnI1u XQhEq/d50Pk3d6uiBwjapbwerJTLoT0= Received: by mail-yb1-f181.google.com with SMTP id 3f1490d57ef6-d74a012e613so2700702276.1 for ; Mon, 21 Aug 2023 19:51:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1692672710; x=1693277510; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=BD58/XthciO0sTCaxjR+dsg5ZYNZ/J3nmIWihGISY24=; b=Q0qcwIyCJhThDsq21LrOEUM4nzcH1AytzGvc8q68ncl1mJ+TCbnsbsnqgXNKXHlnfV DW1P2aIRbJwxPz4W/lCAKObkYGQ21iOYfNp3rzyR5qwadhIk7ILSWaIULUS2h8WGw6a9 rlXKAzIVrjqasXr8byUeVrdd0PyWBbTaDz736M7iFsTTtsTwOgxmkEgo8Y8npLfzZ6le vZjEl3gNCkWohBuLVPgBLHLBjkVeDc1/PGmPEP/IZ9tWMePJ1s8nI8wLexg9la++INv1 iFWlafywvOUU+emzq1kdkkES4j9YqSSYxNCUuV1wQo77eM3kjveaJMseJ2onnShSco3s XhRw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692672710; x=1693277510; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=BD58/XthciO0sTCaxjR+dsg5ZYNZ/J3nmIWihGISY24=; b=gElMz6nIkiz6jxY3mtOhiAp4t6+qwfiq7PDsy5RHFnt5mBHYPB999sPejfnXyqmuw3 gROUhn/zmIRDl9rqm/XLbKkYfL+gtbc8V6YG/CzZWck2eaLZA/imXvoBQO1PbMTYP/hQ wIrOZFj6z+gO8TUtQoZFdCr1Wjy1XqjAf3R26PyEFhG9vadfM5xF5pBC8IEtoVQwzdJz ucb6RrJPk4g541Ywe8IqgBn7oVEEqtefKYR/1vvYWYyoHjHou9w7ciiokZZqShAbzgtv ni4cODdjeUsNEZKYQBUBbEtZRefu5RPi7ZVFVf6ec0vQVqr2QqOt5vGyu+THdzyxoORf 61AQ== X-Gm-Message-State: AOJu0YxANta1g0/tOma3xiXl8ReZAeIM+MQvT36tRKuE+BCD0liRVbdP jY4ZIZd7fved5sJlnIE/u+dmKw== X-Google-Smtp-Source: AGHT+IEEalFENjTwS6ybo+mX3CovLc5hqPqdTe+cS18Ojl4MmAmt12MBrpQ2mhlWRcpCfqQmCFJDnQ== X-Received: by 2002:a05:6902:566:b0:d0f:dc7d:ff19 with SMTP id a6-20020a056902056600b00d0fdc7dff19mr6752743ybt.9.1692672710361; Mon, 21 Aug 2023 19:51:50 -0700 (PDT) Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147]) by smtp.gmail.com with ESMTPSA id n82-20020a25da55000000b00d5d4bae6fdfsm2146761ybf.30.2023.08.21.19.51.45 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 21 Aug 2023 19:51:49 -0700 (PDT) Date: Mon, 21 Aug 2023 19:51:38 -0700 (PDT) From: Hugh Dickins X-X-Sender: hugh@ripple.attlocal.net To: Jann Horn cc: Hugh Dickins , Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Vishal Moola , Vlastimil Babka , Zi Yan , Zach O'Keefe , Linux ARM , sparclinux@vger.kernel.org, linuxppc-dev , linux-s390 , kernel list , Linux-MM Subject: Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd In-Reply-To: Message-ID: References: <4d31abf5-56c0-9f3d-d12f-c9317936691@google.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="-1463760895-2088143463-1692672709=:1872" X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 7F6D2180012 X-Stat-Signature: xnkwdneusxyry16e3urmcwm5qu99hdki X-HE-Tag: 1692672711-410795 X-HE-Meta: U2FsdGVkX1/pbaU5usZpoBIOgdx12ORKNw1j12Wm3BKrh5wuTCn55PkWz/e0SEkmKhwHDf6LmpIlYiFHdt+FbFRTMv1VVkrmszq81GE8klGo0y1dUXR2XdYDGfYG3SJ/Mt5ZSZ/Mtd9x+SjaGZ/jj6NnX3HKSlwVfj306gqxe1rUDKxU8NG5bEcMe3qbQZEMIYdimOhK8M1Jgulecj5a9c3JH3xDrWs+ue94LDcidDe2N471l8Fc9QsmTWkO9mh74b6J3OQX4WpamRlEdvX5da8SJOP9yTHMjm+EYUM8uma+VokvR1yyfYu5FuRpqlSzQimhagRJSSVfMvIXqoYhgTK7xNunZGCQrilc3Kftlpfl8qfM11S6US45WXBmswZrIz6Rk+0ZNv+gW8HK2FSqvsDPgjsHnv2tE7FTmo7uiAqazNOziWVKc69y4q48l0d6Mi7edRhuh7NlZqo77DNd9RKPd8jLZ6JUDtxT/8t7T8aim2ltd3X8PeKnvhz99bA5GsNvGrvsG1LpLQQBFLxjkZPSRr0QKLAFDmgyU7gsUABMZtwBnZXCx1PjKfJ6GZVdAf3sf6zFHAFVnkLEN/UPCHFGWicyj56ovWqjEEtqcMvvGm4M8Imv7ee2eC93u64nPXPZNWFXiQvl9ihtZC4OTFliz/cKh6Rgwf83i+vJqpDwmOybUYs0CLLMZN2gPrwiJsbJvWDqQll0luWuX3ZQZMsC/g4TD4qsKEvp8uAG0tVDfBDulae4zcHVV1Qee7yzPpPVFw8TDoSFqK65q0NURtq0ALhjmqWgKKrjvvLVwxQCVdIkfq47h7QfxUGexs3F98e/HCZKf/XNtJv5WGcYVYw/J/4C260U9MHUtwOVvYbx11a5BzkpGyxga7jbslfzFFTHdgmsplvoYHm2l3FXsqfO9sSzXThAaq+l8ChLkEPWtc6PgWcilNWrxDTgMPADecazO4nDD3PEbdKY6jX 067jzXzo 9U2XjdEt/C0Kj1VLaAAabowGFFXwQrzGquxutwJ9BeSPOztDL/8frSFDECA0pXSIidusNarzP/d2rTf8U+q+zqvv5yb/Z5ea9AFf+vsZVOg+iEP2NmnIzgpZVGJ/klCa87TK2WtElsaEo8CaRPJhf+CJJgbmrzEkC4Pj8HwnsNMU3YIjZmpydB8LAQ24UZjil9/+OUQutGdJpXaBI36b64vTJ2Ar1CqMOWZSNpCUNX4aFsTIZQCLlhT2GtzFhobisFKzj3dMjHjdbvJdvzTvbevHBg1dG5fHqBvBfW3spm2DbIjDpNbA2OR4XIYhLp8z73/JWn2SWosC5ld2JSMJjkxf6UHDJctdDLFRD6lfZonlwLIXDAtUht7vCyBJQZA5XU+OD74nJAeGz7tYiHfKl/ukK2aXGW108qY/TdgHnIsGeHUXtPHFHEzoAW2E6X6Sgmj34Erl8+veYy40ad5oe167mFvlqJBNG+HvyTx0F89Qc2doTzK5+YyHdjv28yodxz+Q6it2QivX95/qwWxMWX7MFkZMiq5+lAEYhO1lx0uen+wVRvjsu/ALEQIg7ZpeBeGoCcMb+AHnBhCZmLIvav/fUl6EDS6goSM2vh0/u/zpi0NrQn6sNroRpuRPzgi9H15moRKd3sxi0NYE7envSy2iaX/FYuDOUfCQe X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. ---1463760895-2088143463-1692672709=:1872 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE On Mon, 21 Aug 2023, Jann Horn wrote: > On Mon, Aug 21, 2023 at 9:51=E2=80=AFPM Hugh Dickins w= rote: > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp(= ) > > thought it had emptied: page lock on the huge page is enough to protect > > against WP faults (which find the PTE has been cleared), but not enough > > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > > > retract_page_tables() protects against this by checking !vma->anon_vma; > > but we know that MADV_COLLAPSE needs to be able to work on private shme= m > > mappings, even those with an anon_vma prepared for another part of the > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > > mappings which are userfaultfd_armed(). Whether it needs to work on > > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > > but assume that it does. >=20 > I think we couldn't rely on anon_vma here anyway, since holding the > mmap_lock in read mode doesn't prevent concurrent creation of an > anon_vma? We would have had to do the same as in retract_page_tables() (which doesn't even have mmap_lock for read): recheck !vma->anon_vma after finally acquiring ptlock. But the !anon_vma limitation is certainly not acceptable here anyway. >=20 > > Just for this case, take the pmd_lock() two steps earlier: not because > > it gives any protection against this case itself, but because ptlock > > nests inside it, and it's the dropping of ptlock which let the bug in. > > In other cases, continue to minimize the pmd_lock() hold time. >=20 > Special-casing userfaultfd like this makes me a bit uncomfortable; but > I also can't find anything other than userfaultfd that would insert > pages into regions that are khugepaged-compatible, so I guess this > works? I'm as sure as I can be that it's solely because userfaultfd breaks the usual rules here (and in fairness, IIRC Andrea did ask my permission before making it behave that way on shmem, COWing without a source page). Perhaps something else will want that same behaviour in future (it's tempting, but difficult to guarantee correctness); for now, it is just userfaultfd (but by saying "_armed" rather than "_missing", I'm half- expecting uffd to add more such exceptional modes in future). >=20 > I guess an alternative would be to use a spin_trylock() instead of the > current pmd_lock(), and if that fails, temporarily drop the page table > lock and then restart from step 2 with both locks held - and at that > point the page table scan should be fast since we expect it to usually > be empty. That's certainly a good idea, if collapse on userfaultfd_armed private is anything of a common case (I doubt, but I don't know). It may be a better idea anyway (saving a drop and retake of ptlock). I gave it a try, expecting to end up with something that would lead me to say "I tried it, but it didn't work out well"; but actually it looks okay to me. I wouldn't say I prefer it, but it seems reasonable, and no more complicated (as Peter rightly observes) than the original. It's up to you and Peter, and whoever has strong feelings about it, to choose between them: I don't mind (but I shall be sad if someone demands that I indent that comment deeper - I'm not a fan of long multi-line comments near column 80). [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus = uffd Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() thought it had emptied: page lock on the huge page is enough to protect against WP faults (which find the PTE has been cleared), but not enough to protect against userfaultfd. "BUG: Bad rss-counter state" followed. retract_page_tables() protects against this by checking !vma->anon_vma; but we know that MADV_COLLAPSE needs to be able to work on private shmem mappings, even those with an anon_vma prepared for another part of the mapping; and we know that MADV_COLLAPSE needs to work on shared shmem mappings which are userfaultfd_armed(). Whether it needs to work on private shmem mappings which are userfaultfd_armed(), I'm not so sure: but assume that it does. Now trylock pmd lock without dropping ptlock (suggested by jannh): if that fails, drop and retake ptlock around taking pmd lock, and just in the uffd private case, go back to recheck and empty the page table. Reported-by: Jann Horn Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=3Drg5FW-kY= D5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_re= ad_lock()") Signed-off-by: Hugh Dickins --- mm/khugepaged.c | 39 +++++++++++++++++++++++++++++---------- 1 file changed, 29 insertions(+), 10 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 40d43eccdee8..ad1c571772fe 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, =09struct page *hpage; =09pte_t *start_pte, *pte; =09pmd_t *pmd, pgt_pmd; -=09spinlock_t *pml, *ptl; +=09spinlock_t *pml =3D NULL, *ptl; =09int nr_ptes =3D 0, result =3D SCAN_FAIL; =09int i; =20 @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, un= signed long addr, =09=09=09=09haddr, haddr + HPAGE_PMD_SIZE); =09mmu_notifier_invalidate_range_start(&range); =09notified =3D true; -=09start_pte =3D pte_offset_map_lock(mm, pmd, haddr, &ptl); -=09if (!start_pte)=09=09/* mmap_lock + page lock should prevent this */ -=09=09goto abort; +=09spin_lock(ptl); +recheck: +=09start_pte =3D pte_offset_map(pmd, haddr); +=09VM_BUG_ON(!start_pte);=09/* mmap_lock + page lock should prevent this *= / =20 =09/* step 2: clear page table and adjust rmap */ =09for (i =3D 0, addr =3D haddr, pte =3D start_pte; @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, u= nsigned long addr, =09=09nr_ptes++; =09} =20 -=09pte_unmap_unlock(start_pte, ptl); +=09pte_unmap(start_pte); =20 =09/* step 3: set proper refcount and mm_counters. */ =09if (nr_ptes) { =09=09page_ref_sub(hpage, nr_ptes); =09=09add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); +=09=09nr_ptes =3D 0; =09} =20 -=09/* step 4: remove page table */ +=09/* step 4: remove empty page table */ +=09if (!pml) { +=09=09pml =3D pmd_lockptr(mm, pmd); +=09=09if (pml !=3D ptl && !spin_trylock(pml)) { +=09=09=09spin_unlock(ptl); +=09=09=09spin_lock(pml); +=09=09=09spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); +=09/* +=09 * pmd_lock covers a wider range than ptl, and (if split from mm's +=09 * page_table_lock) ptl nests inside pml. The less time we hold pml, +=09 * the better; but userfaultfd's mfill_atomic_pte() on a private VMA +=09 * inserts a valid as-if-COWed PTE without even looking up page cache. +=09 * So page lock of hpage does not protect from it, so we must not drop +=09 * ptl before pgt_pmd is removed, so uffd private needs rechecking. +=09 */ +=09=09=09if (userfaultfd_armed(vma) && +=09=09=09 !(vma->vm_flags & VM_SHARED)) +=09=09=09=09goto recheck; +=09=09} +=09} =20 -=09/* Huge page lock is still held, so page table must remain empty */ -=09pml =3D pmd_lock(mm, pmd); -=09if (ptl !=3D pml) -=09=09spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); =09pgt_pmd =3D pmdp_collapse_flush(vma, haddr, pmd); =09pmdp_get_lockless_sync(); =09if (ptl !=3D pml) @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, uns= igned long addr, =09} =09if (start_pte) =09=09pte_unmap_unlock(start_pte, ptl); +=09if (pml && pml !=3D ptl) +=09=09spin_unlock(pml); =09if (notified) =09=09mmu_notifier_invalidate_range_end(&range); drop_hpage: --=20 2.35.3 ---1463760895-2088143463-1692672709=:1872--