From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CAFFAEE4993 for ; Tue, 22 Aug 2023 14:31:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 32A1E28003A; Tue, 22 Aug 2023 10:31:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2DA2894001B; Tue, 22 Aug 2023 10:31:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1548A28003A; Tue, 22 Aug 2023 10:31:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 0175094001B for ; Tue, 22 Aug 2023 10:31:37 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id B5E08120275 for ; Tue, 22 Aug 2023 14:31:37 +0000 (UTC) X-FDA: 81151979034.30.28CD53A Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf26.hostedemail.com (Postfix) with ESMTP id EE8B9140048 for ; Tue, 22 Aug 2023 14:31:34 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=R9eFvCfb; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1692714695; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=rHU6iKWanXrL3JWU9U/jjuWCG6VQHlMBWf5LyP3yyJQ=; b=CBU6xNxWqeQLrDH1bLcyTdAqJKotrdPkjGlyum3GyswENPR92RuU+pBt2hTvVv/Xr/0hTl XeDifNxvgJBsKvVIoZwssdxc8Lq+1nk2UNsTXXP4ng42cKSu0HJWFDuxxZ06OnJbqI0+nG SFaFDarvuFVH7B3JmHlc7W64CsPb+/M= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=R9eFvCfb; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf26.hostedemail.com: domain of peterx@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=peterx@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1692714695; a=rsa-sha256; cv=none; b=kFSF67LSVIBpUhaB994BdVFN7Zxu/1e6Z/pRyn5xYq+QiIiuXC2A1atSgNbG7kZfrRP14n xDwWh/inO8/2BdH3LoPol2UNMtRxrcJ53P/63jfjaIhkxpij41woeqKJGV0alSqlg6mhFH 7uP+UyrkvGjuNd1eEoJ6QgHav9TLdEY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1692714694; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=rHU6iKWanXrL3JWU9U/jjuWCG6VQHlMBWf5LyP3yyJQ=; b=R9eFvCfbVveJzwCGkFTcDufrDWYvelQZJE7h7qIF9QvwVveszOVjGrrcwI6RG7mKfWlS0X fI/YL5uukqEX6Af6AlyCd70AYE/7n+FqfimP0Bcnv66tp/hwDTJRSlPJ65fEw5hI6XsSJw gZOaigL6AmGc4Xt9gxVjZGhd2r7cFuw= Received: from mail-qt1-f199.google.com (mail-qt1-f199.google.com [209.85.160.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-17-znEVzm-TMPKLmpT7as4_Iw-1; Tue, 22 Aug 2023 10:31:31 -0400 X-MC-Unique: znEVzm-TMPKLmpT7as4_Iw-1 Received: by mail-qt1-f199.google.com with SMTP id d75a77b69052e-41098865429so4038761cf.1 for ; Tue, 22 Aug 2023 07:31:31 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1692714691; x=1693319491; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=rHU6iKWanXrL3JWU9U/jjuWCG6VQHlMBWf5LyP3yyJQ=; b=CjYsrePNSvJJCYmskpwqS4xVsNXtOYwBBkXbwkydlDT9zw9EHnbOmMP5jmV/zXiVzU 2ccalR+Vd0jKtop2elgtkiZNG1S+oqegfTboBsqzU5xK+UBAQO9ILzvgMgOIxk/wc1Nb 3L3O5NLsDI2MFV6AGVKBs9tQGQ3v0i0o3ChbtoLcK7opolWB0XP/JMbzBXXoo9sih+hg S5j1UoA6DDu7xCkKEgaTmAnUJqvelYiesayzYqHeLeXUJ+a+a4IVXuSXfwsf/sIV8L+t jzl8pi+AlxUCDBs88FpaBamn9b1OXbLeVHajbA9NEhtRx9fcgb7+03+grPfUlC6FOnNl mC/g== X-Gm-Message-State: AOJu0YzWHQCpyLKSb8hb1dZABTMH+gL8EJu1vGUxwF7weXOtcU5AQb3D 9kCOn431wxjN0nY+W/NBPsH7iSUrNXrts8Y1PWpRDbR9ViLmyzmtSLsfY/kNdJHj0vov/uPIjSm UFT9ItA/R5Bs= X-Received: by 2002:a05:622a:309:b0:403:b395:b38e with SMTP id q9-20020a05622a030900b00403b395b38emr11725407qtw.2.1692714691373; Tue, 22 Aug 2023 07:31:31 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGEDrFZmmOGoiw13xIqX9GTgknh8pZdmdD1gL5tuiszHwvGibdf+ny9PWf+EvJqXpP4fNvIlg== X-Received: by 2002:a05:622a:309:b0:403:b395:b38e with SMTP id q9-20020a05622a030900b00403b395b38emr11725343qtw.2.1692714690938; Tue, 22 Aug 2023 07:31:30 -0700 (PDT) Received: from x1n (cpe5c7695f3aee0-cm5c7695f3aede.cpe.net.cable.rogers.com. [99.254.144.39]) by smtp.gmail.com with ESMTPSA id z8-20020ac87108000000b0040331a24f16sm3099802qto.3.2023.08.22.07.31.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 22 Aug 2023 07:31:30 -0700 (PDT) Date: Tue, 22 Aug 2023 10:31:25 -0400 From: Peter Xu To: Hugh Dickins Cc: Jann Horn , Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Lorenzo Stoakes , Huang Ying , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , Gerald Schaefer , Vasily Gorbik , Vishal Moola , Vlastimil Babka , Zi Yan , Zach O'Keefe , Linux ARM , sparclinux@vger.kernel.org, linuxppc-dev , linux-s390 , kernel list , Linux-MM Subject: Re: [PATCH mm-unstable] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd Message-ID: References: <4d31abf5-56c0-9f3d-d12f-c9317936691@google.com> MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: EE8B9140048 X-Stat-Signature: xfy6pbty49m3qf5co6woojeanxqiwtfw X-HE-Tag: 1692714694-433886 X-HE-Meta: U2FsdGVkX1+K8FE59971lTKXsV34sn78+5J+pVtgjON81uL9BJ7nixRCjWFI/YkI7cH7SYbzGKHIMBTKgrwUr26qckU5Z8heh6IglwPtbLdlUpw0SWVGD8zpjQHBrYyrgwDngb4CQO2FAufjS21haeFrsE3YWOvZnlzrod026YH/Q4tMoYHDUjtoJ7OA5GK1o1oxvHH1L7yRP4yVe766eKf5mtlN49THfJsux1gDS7EHI6Sa695zoegtMnDa+qB9a235M3ENnKBScB5Q5VXMJAr4aa6k9fEFMsn+aTXCo3ftoSTnVNIhodeuAqZHl9ADVLwT5Z8JPh42JkdvJeijsBGVz6Tm4OuUkFVq7Kw8FlsckBsP62Il9sL6McvjgzEwiyR3iHN+xyC/qcAbOVLakSeWRJ2ADQ6o5htuCyq+8V5Uj6ms1sjqcjIqGzGEf/S8LPmKkodSxq8rbXRW7s7bdMc7CWdA/wkLYNfQ0U4fhvngf27Du7oA1NOkm+z4R/9VwUIOe/aR3P+L+AQW11W3uZfwg/2l8zG7YcAUtpaz6dPGYEy14wb+RDgj3R3zEGuW4TgjGaO7tgtdLguVIVO+QTASGn8YmEuHSmvSouUh71ZmuRTW5ZBli4iltAKWCtoq3hEgJ/Nalv90REjVknGgmGC8YMb5OBtk70iydRo8ovOY8jYXuW4otrolr0fW5GedetE0Il/n6diP4vm4wu0hijxkQn4d03qSKyvF7WEThlG5J7LIdXOHL9d3zPCge75MoYCPNk5+zSnO8cPDzroaNuUNdIV2alfEPU+4h+erk4H8xWTzCVJ+0/HA4NOaNbieQas3nM5b5m6rCoc/cSGRGEkGf4IGf/DSIBBSAUbnrl2HPcnTic7FkvDeui61K83LY6GYXHKI0nwhSi74j2R8EQuxVmvP+ffGxkngFrPbznCkUZC2cC+kqUTjTfYZH0E+sv14PSyu78smdVcGRsy IW3i/fZO fjQ7ZNnan9n2tDC82dESwgaeWMOGM+4hETtgKt7VW0CgUk7x5KAE4KP2GNgyyqnOETA2FsYKmajiP90nhUBfX/TeX0gyGcZVefPcHgJCyv6TYB1uSlEIwkJ6bDD3I7qkoFOJUFVOcbJ2x8iqj2nSOUPSUKf+Vy4JSCh+YDTSFRbWrMnz0l5DMycloRbEV9Oqiy77anYH8UINnU/ZDPCiNAtNpwXVFcoGqTRvRztl0KeFK/H6DmZVm3cS66diITArEM1j9gQFEpqxSJz90dOunL0Pvkd+lyYfJ9tzrakdUii7V7dwWwLVAxi9d+Zp1jppMawdq9SizWQUWhQX1A1IdpJQISdDY7cGDQ9Nh68OUG84k3TX+vLgnFIFBtwXzZFbLtYr+yeo+SNo9ZSXD+0gvAUHzKNjy6Ry3bvJXPNnEHqMG9IdKHCdsxXNEyb6rM45n9xDBdqmGBok+mDv/A0SdyRySqxypxngQuUEJIpq34HcU4go6c1ox6BofuS4KZzZD5XaKe2mHAQwypQMATK2kimNZr7qU3pxrzXfut8eT0HURbtilNz8FmtXZp3JYJK6Jw861AscDPoz1TlLvyyIq2gt3Nw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi, Hugh, Jann, On Mon, Aug 21, 2023 at 07:51:38PM -0700, Hugh Dickins wrote: > On Mon, 21 Aug 2023, Jann Horn wrote: > > On Mon, Aug 21, 2023 at 9:51 PM Hugh Dickins wrote: > > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > > > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > > > thought it had emptied: page lock on the huge page is enough to protect > > > against WP faults (which find the PTE has been cleared), but not enough > > > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > > > > > retract_page_tables() protects against this by checking !vma->anon_vma; > > > but we know that MADV_COLLAPSE needs to be able to work on private shmem > > > mappings, even those with an anon_vma prepared for another part of the > > > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > > > mappings which are userfaultfd_armed(). Whether it needs to work on > > > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > > > but assume that it does. > > > > I think we couldn't rely on anon_vma here anyway, since holding the > > mmap_lock in read mode doesn't prevent concurrent creation of an > > anon_vma? > > We would have had to do the same as in retract_page_tables() (which > doesn't even have mmap_lock for read): recheck !vma->anon_vma after > finally acquiring ptlock. But the !anon_vma limitation is certainly > not acceptable here anyway. > > > > > > Just for this case, take the pmd_lock() two steps earlier: not because > > > it gives any protection against this case itself, but because ptlock > > > nests inside it, and it's the dropping of ptlock which let the bug in. > > > In other cases, continue to minimize the pmd_lock() hold time. > > > > Special-casing userfaultfd like this makes me a bit uncomfortable; but > > I also can't find anything other than userfaultfd that would insert > > pages into regions that are khugepaged-compatible, so I guess this > > works? > > I'm as sure as I can be that it's solely because userfaultfd breaks > the usual rules here (and in fairness, IIRC Andrea did ask my permission > before making it behave that way on shmem, COWing without a source page). > > Perhaps something else will want that same behaviour in future (it's > tempting, but difficult to guarantee correctness); for now, it is just > userfaultfd (but by saying "_armed" rather than "_missing", I'm half- > expecting uffd to add more such exceptional modes in future). > > > > > I guess an alternative would be to use a spin_trylock() instead of the > > current pmd_lock(), and if that fails, temporarily drop the page table > > lock and then restart from step 2 with both locks held - and at that > > point the page table scan should be fast since we expect it to usually > > be empty. > > That's certainly a good idea, if collapse on userfaultfd_armed private > is anything of a common case (I doubt, but I don't know). It may be a > better idea anyway (saving a drop and retake of ptlock). > > I gave it a try, expecting to end up with something that would lead > me to say "I tried it, but it didn't work out well"; but actually it > looks okay to me. I wouldn't say I prefer it, but it seems reasonable, > and no more complicated (as Peter rightly observes) than the original. > > It's up to you and Peter, and whoever has strong feelings about it, > to choose between them: I don't mind (but I shall be sad if someone > demands that I indent that comment deeper - I'm not a fan of long > multi-line comments near column 80). No strong opinion here, either. Just one trivial comment/question below on the new patch (if that will be preferred).. > > > [PATCH mm-unstable v2] mm/khugepaged: fix collapse_pte_mapped_thp() versus uffd > > Jann Horn demonstrated how userfaultfd ioctl UFFDIO_COPY into a private > shmem mapping can add valid PTEs to page table collapse_pte_mapped_thp() > thought it had emptied: page lock on the huge page is enough to protect > against WP faults (which find the PTE has been cleared), but not enough > to protect against userfaultfd. "BUG: Bad rss-counter state" followed. > > retract_page_tables() protects against this by checking !vma->anon_vma; > but we know that MADV_COLLAPSE needs to be able to work on private shmem > mappings, even those with an anon_vma prepared for another part of the > mapping; and we know that MADV_COLLAPSE needs to work on shared shmem > mappings which are userfaultfd_armed(). Whether it needs to work on > private shmem mappings which are userfaultfd_armed(), I'm not so sure: > but assume that it does. > > Now trylock pmd lock without dropping ptlock (suggested by jannh): if > that fails, drop and retake ptlock around taking pmd lock, and just in > the uffd private case, go back to recheck and empty the page table. > > Reported-by: Jann Horn > Closes: https://lore.kernel.org/linux-mm/CAG48ez0FxiRC4d3VTu_a9h=rg5FW-kYD5Rg5xo_RDBM0LTTqZQ@mail.gmail.com/ > Fixes: 1043173eb5eb ("mm/khugepaged: collapse_pte_mapped_thp() with mmap_read_lock()") > Signed-off-by: Hugh Dickins > --- > mm/khugepaged.c | 39 +++++++++++++++++++++++++++++---------- > 1 file changed, 29 insertions(+), 10 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 40d43eccdee8..ad1c571772fe 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1476,7 +1476,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > struct page *hpage; > pte_t *start_pte, *pte; > pmd_t *pmd, pgt_pmd; > - spinlock_t *pml, *ptl; > + spinlock_t *pml = NULL, *ptl; > int nr_ptes = 0, result = SCAN_FAIL; > int i; > > @@ -1572,9 +1572,10 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > haddr, haddr + HPAGE_PMD_SIZE); > mmu_notifier_invalidate_range_start(&range); > notified = true; > - start_pte = pte_offset_map_lock(mm, pmd, haddr, &ptl); > - if (!start_pte) /* mmap_lock + page lock should prevent this */ > - goto abort; > + spin_lock(ptl); .. here will the ptl always be valid? That comes from the previous round of pte_offset_map_lock(), and I assume after this whole "thp collapse without write lock" work landed, it has the same lifecycle with the *pte pointer, so can be invalid right after the rcu read lock released; mmap read lock isn't strong enough to protect the ptl, not anymore. Maybe it's all fine because the thp collapse path is the solo path(s) that will release the pte pgtable page without write mmap lock (so as to release the ptl too when doing so), and we at least still hold the page lock, so the worst case is the other concurrent "thp collapse" will still serialize with this one on the huge page lock. But that doesn't look as solid as fetching again the ptl from another pte_offset_map_nolock(). So still just raise this question up. It's possible I just missed something. > +recheck: > + start_pte = pte_offset_map(pmd, haddr); > + VM_BUG_ON(!start_pte); /* mmap_lock + page lock should prevent this */ > > /* step 2: clear page table and adjust rmap */ > for (i = 0, addr = haddr, pte = start_pte; > @@ -1608,20 +1609,36 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > nr_ptes++; > } > > - pte_unmap_unlock(start_pte, ptl); > + pte_unmap(start_pte); > > /* step 3: set proper refcount and mm_counters. */ > if (nr_ptes) { > page_ref_sub(hpage, nr_ptes); > add_mm_counter(mm, mm_counter_file(hpage), -nr_ptes); > + nr_ptes = 0; > } > > - /* step 4: remove page table */ > + /* step 4: remove empty page table */ > + if (!pml) { > + pml = pmd_lockptr(mm, pmd); > + if (pml != ptl && !spin_trylock(pml)) { > + spin_unlock(ptl); > + spin_lock(pml); > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + /* > + * pmd_lock covers a wider range than ptl, and (if split from mm's > + * page_table_lock) ptl nests inside pml. The less time we hold pml, > + * the better; but userfaultfd's mfill_atomic_pte() on a private VMA > + * inserts a valid as-if-COWed PTE without even looking up page cache. > + * So page lock of hpage does not protect from it, so we must not drop > + * ptl before pgt_pmd is removed, so uffd private needs rechecking. > + */ > + if (userfaultfd_armed(vma) && > + !(vma->vm_flags & VM_SHARED)) > + goto recheck; > + } > + } > > - /* Huge page lock is still held, so page table must remain empty */ > - pml = pmd_lock(mm, pmd); > - if (ptl != pml) > - spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > pgt_pmd = pmdp_collapse_flush(vma, haddr, pmd); > pmdp_get_lockless_sync(); > if (ptl != pml) > @@ -1648,6 +1665,8 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, > } > if (start_pte) > pte_unmap_unlock(start_pte, ptl); > + if (pml && pml != ptl) > + spin_unlock(pml); > if (notified) > mmu_notifier_invalidate_range_end(&range); > drop_hpage: > -- > 2.35.3 -- Peter Xu