From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4084DC7EE2D for ; Wed, 31 May 2023 15:35:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A12878E0002; Wed, 31 May 2023 11:35:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9C2C68E0001; Wed, 31 May 2023 11:35:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 863668E0002; Wed, 31 May 2023 11:35:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 745388E0001 for ; Wed, 31 May 2023 11:35:38 -0400 (EDT) Received: from smtpin07.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id C86AA1C7233 for ; Wed, 31 May 2023 15:35:37 +0000 (UTC) X-FDA: 80850949914.07.4C0AA52 Received: from mail-il1-f170.google.com (mail-il1-f170.google.com [209.85.166.170]) by imf17.hostedemail.com (Postfix) with ESMTP id 0425740017 for ; Wed, 31 May 2023 15:35:35 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=F3utUZ5j; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of jannh@google.com designates 209.85.166.170 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1685547336; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=dRRkvCZYbSlg4t8xNmIDDwHUKuhLfHD0cIrw3700dVo=; b=OQ2HUUPpY0FeZv8XE72C5WBTg9MmAZbu9o0Z9tJgHxfKn4XoaDdkkzdB++hBzksfKT4xw3 mGNyfl0ErWs3hyo/Tc0/cAD+ckFGD8tNDG6EaWnDSh7FvxqPQc0BJ5soxDyB9e7fepmqs+ pKTt7sAKasplgAWr2oQUF8+Kcg6YYGA= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=F3utUZ5j; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf17.hostedemail.com: domain of jannh@google.com designates 209.85.166.170 as permitted sender) smtp.mailfrom=jannh@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1685547336; a=rsa-sha256; cv=none; b=xsQXR0ekMqz4PXNA044ACx3cbUlfzw+fiCWaGHs/GYi9B+BhCS2V4ZjW2bSBLRAalcD2ic 7BrjYhpYXg2WdqTMuUitcz0cyLWk8/2i8Pk3wwvofv7f5yhKTM+ZK4srD1YfxbYO/rgLdc nPZSl0gJ1CmoVjBc9xxb5fjGbhxwIt8= Received: by mail-il1-f170.google.com with SMTP id e9e14a558f8ab-33b7f217dd0so197335ab.0 for ; Wed, 31 May 2023 08:35:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1685547335; x=1688139335; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=dRRkvCZYbSlg4t8xNmIDDwHUKuhLfHD0cIrw3700dVo=; b=F3utUZ5jdPTR1x7kwep/a3O/M1O2FfEohIVXcz/CBMY94I7ghYAdOfOwmEteY9N+pu ArxO1ao1FFcwWZ2Ro9xw80509s0jHrC5OKiBRqA3UuPlOFCiAadD2rF+t7Izm6hDF3yJ EPiTxETDha4DPtB9+aejroYDKEjiCItrVP55/xj7j8umM0YMbdCstRaCk5tY7JllwJ9X P8r6ZqbUG90uBJZcbeKX9hyczmmujIw/7PhyCwbKan8AQ0AN6K8lxHE4PE178uEet/cV XYvgARLy0SyrigU58Rs1QqIihvIgtm0J45cpH119527xv3T4RALnMC4bstaRMr/TCn5P UInw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1685547335; x=1688139335; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=dRRkvCZYbSlg4t8xNmIDDwHUKuhLfHD0cIrw3700dVo=; b=Jxt5sQsKC3lemvuq8qlOzvXEYEFnQm7JRV3r7UbgraehzcT2wBEiEcl/orsOqeq34P Aqr1HSzUnJRD7sRGbI7ENQA+l0ePH/NN2ehpaiVsw2jzNxytO+1Kl399CJla82S7bF04 Ofe2AlTU1Q+cm2AAKHen9/3jXNhWKsfO8ifvU7TOjkp8hCxmawM/gvadmF+IR+kpcBvl ZdjZLbblUvHVH3BJfasIFmAmN+Qoe107CsCsLP8lXWLls7SmZmFosh17TODZplsy2VwW OhBcNTkSp1jBMqlWs5EWo7II8CpDdnVIhBxL5JFQSfEMY/IixwsbD1DfnSjBTpPVO/AI OGXA== X-Gm-Message-State: AC+VfDy3WqYvJuaOrrZPA28mY7v2nFHIuuIunyBhmfhiP8w3LJuF/9+f ZnqE7Ojewuh29MiZWZJw5OLMKulFRqDRyjRGXMbPcg== X-Google-Smtp-Source: ACHHUZ4mNqVesHou0LfeMRx7lS528vvvhkrqWTEzo+XSDqTFvF5QDhctZ96+3EobWyEv8r+F1qci8LfCFS6OSAYqXto= X-Received: by 2002:a05:6e02:1a26:b0:33b:3bf4:9f42 with SMTP id g6-20020a056e021a2600b0033b3bf49f42mr126993ile.19.1685547334866; Wed, 31 May 2023 08:35:34 -0700 (PDT) MIME-Version: 1.0 References: <35e983f5-7ed3-b310-d949-9ae8b130cdab@google.com> <2e9996fa-d238-e7c-1194-834a2bd1f60@google.com> In-Reply-To: <2e9996fa-d238-e7c-1194-834a2bd1f60@google.com> From: Jann Horn Date: Wed, 31 May 2023 17:34:58 +0200 Message-ID: Subject: Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock To: Hugh Dickins Cc: Andrew Morton , Mike Kravetz , Mike Rapoport , "Kirill A. Shutemov" , Matthew Wilcox , David Hildenbrand , Suren Baghdasaryan , Qi Zheng , Yang Shi , Mel Gorman , Peter Xu , Peter Zijlstra , Will Deacon , Yu Zhao , Alistair Popple , Ralph Campbell , Ira Weiny , Steven Price , SeongJae Park , Naoya Horiguchi , Christophe Leroy , Zack Rusin , Jason Gunthorpe , Axel Rasmussen , Anshuman Khandual , Pasha Tatashin , Miaohe Lin , Minchan Kim , Christoph Hellwig , Song Liu , Thomas Hellstrom , Russell King , "David S. Miller" , Michael Ellerman , "Aneesh Kumar K.V" , Heiko Carstens , Christian Borntraeger , Claudio Imbrenda , Alexander Gordeev , linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 0425740017 X-Stat-Signature: z37qnka6c9hgyr97ksms3xx4xibadayh X-HE-Tag: 1685547335-861701 X-HE-Meta: U2FsdGVkX1+vgpjiVP4oA3EHVGJmFda+jG1YBlnYA7lwqIHZLxAS5ijNAr4KC+FBIIcHUJi/jbuePbBbm9o0QTYawWMKR3CmT20SVl/kV+efXvNdyJR9F6T7E6Umcs59Wb0nQOMDZwB6d9d7ohRUQcDlny1dq2vNYCCnSv19svJAv5JkIhIuSfXAXNfes9OfKmTLM8dxTegh5xXmi+YTETmle4o00SXo99jw90Ac5EwNZJT37jzXw8+896mo05E2+xMb+/spf6aUAiijx3aXOqSfZcBzNIG0i7M+0KQZ8RBDbrDuTPHLncKhsISUtMrPPB8wqFYGaTdDQ5cHTfKNkZpK8Q16Ck6X65OiWmORVsjaVf8DYKT9oE2hAIiaNUqcwNOrcyyTnwmXciIlZRUj7iRDa6BmJoTYDaBcba+rkqrJQDx1cF45xZ+GGEtwWLODtegYsQK2N8Np9vORhFoM8W5s/HIOmUkZj8un/QtD1sCaShWFerNeEh43atxn9BAQGNNty1TuG6Gy0IagJ5eqi6OeFE0WQRi5ybDj8TsL8Xn6VTdpZvGEyulu7VbMJEQ+DS1vDvMG5dqYSf0Rm0uLRZ1f2k8Lc1A/b3x1q7P3bYmp/him0qxUV71dVIFJjAe8yPWPeFeyEbERUAtWc7kOIsZBfATdEQ/N97hOu0bG6wN0JR/yVFK6cmDK/QOeoI2NTMfDW1epcLuxyyYtujVuzT1/E3h1dqqwKQ1ihkUS8ERVwUYHNnrh3oz5GLqj36TTBovy3dlzz/qwAyPd+P1LVk15OqB0sx8r4Frl1bz6CTsUycSlNEOTqklEv1kPVg8VjwrLfPLNu+KkQ0jl9d+dCv/rWsvJ/3WMFpDK8B8OXF3nYpp3Z0gchsjNFq7ERdHtsd1m3iAJP852s/alkqAYUC60cdRfhoDQXrEYaRn0zrGn0XR/ViFu7ohta0h/+5imw4vDupLHa3rmTwTiZSl Jz5aq2Yd mdRWMYI13kd01RrbZMsRe0kBLrsQCNznsngoZujdFVF+wvRJ7RZrc1JBZHfeSlV8qxfoyJYSZXFwAGeid38UoWcLnSPeozgrWoekAjgDbUQP8s/kQnlTW6pDAMTn/P60XqvnCqqohEz9nMG5DIq+dV1ajKze84U4BQfqW0WnrJRM7OwdBWHpBJv+rtFkk89IJzXntJ0VtnJxqLQNEnuIk15Pxg7OVLE+z/q3DtdK3Bn54r3q64kry0bLHLoXF4fpjYNMTcxZmQ0YnfE+2aX5EsqZk+CzQTyrYZ3ahLv3p39JKMseA0b4Gj/kUtIlPUG1AXqP8rq2/qHYNOvrf4RbgjmAXLSA0yj7nbI2G+hkjd77UfiDDTVXY807vx40UfxZi7npf X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 29, 2023 at 8:25=E2=80=AFAM Hugh Dickins wro= te: > -static int retract_page_tables(struct address_space *mapping, pgoff_t pg= off, > - struct mm_struct *target_mm, > - unsigned long target_addr, struct page *hp= age, > - struct collapse_control *cc) > +static void retract_page_tables(struct address_space *mapping, pgoff_t p= goff) > { > struct vm_area_struct *vma; > - int target_result =3D SCAN_FAIL; > > - i_mmap_lock_write(mapping); > + i_mmap_lock_read(mapping); > vma_interval_tree_foreach(vma, &mapping->i_mmap, pgoff, pgoff) { > - int result =3D SCAN_FAIL; > - struct mm_struct *mm =3D NULL; > - unsigned long addr =3D 0; > - pmd_t *pmd; > - bool is_target =3D false; > + struct mm_struct *mm; > + unsigned long addr; > + pmd_t *pmd, pgt_pmd; > + spinlock_t *pml; > + spinlock_t *ptl; > > /* > * Check vma->anon_vma to exclude MAP_PRIVATE mappings th= at > - * got written to. These VMAs are likely not worth invest= ing > - * mmap_write_lock(mm) as PMD-mapping is likely to be spl= it > - * later. > + * got written to. These VMAs are likely not worth removi= ng > + * page tables from, as PMD-mapping is likely to be split= later. > * > - * Note that vma->anon_vma check is racy: it can be set u= p after > - * the check but before we took mmap_lock by the fault pa= th. > - * But page lock would prevent establishing any new ptes = of the > - * page, so we are safe. > - * > - * An alternative would be drop the check, but check that= page > - * table is clear before calling pmdp_collapse_flush() un= der > - * ptl. It has higher chance to recover THP for the VMA, = but > - * has higher cost too. It would also probably require lo= cking > - * the anon_vma. > + * Note that vma->anon_vma check is racy: it can be set a= fter > + * the check, but page locks (with XA_RETRY_ENTRYs in hol= es) > + * prevented establishing new ptes of the page. So we are= safe > + * to remove page table below, without even checking it's= empty. This "we are safe to remove page table below, without even checking it's empty" assumes that the only way to create new anonymous PTEs is to use existing file PTEs, right? What about private shmem VMAs that are registered with userfaultfd as VM_UFFD_MISSING? I think for those, the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without looking at the mapping and its pages (except for checking that the insertion point is before end-of-file), protected only by mmap_lock (shared) and pte_offset_map_lock(). > */ > - if (READ_ONCE(vma->anon_vma)) { > - result =3D SCAN_PAGE_ANON; > - goto next; > - } > + if (READ_ONCE(vma->anon_vma)) > + continue; > + > addr =3D vma->vm_start + ((pgoff - vma->vm_pgoff) << PAGE= _SHIFT); > if (addr & ~HPAGE_PMD_MASK || > - vma->vm_end < addr + HPAGE_PMD_SIZE) { > - result =3D SCAN_VMA_CHECK; > - goto next; > - } > - mm =3D vma->vm_mm; > - is_target =3D mm =3D=3D target_mm && addr =3D=3D target_a= ddr; > - result =3D find_pmd_or_thp_or_none(mm, addr, &pmd); > - if (result !=3D SCAN_SUCCEED) > - goto next; > - /* > - * We need exclusive mmap_lock to retract page table. > - * > - * We use trylock due to lock inversion: we need to acqui= re > - * mmap_lock while holding page lock. Fault path does it = in > - * reverse order. Trylock is a way to avoid deadlock. > - * > - * Also, it's not MADV_COLLAPSE's job to collapse other > - * mappings - let khugepaged take care of them later. > - */ > - result =3D SCAN_PTE_MAPPED_HUGEPAGE; > - if ((cc->is_khugepaged || is_target) && > - mmap_write_trylock(mm)) { > - /* trylock for the same lock inversion as above *= / > - if (!vma_try_start_write(vma)) > - goto unlock_next; > - > - /* > - * Re-check whether we have an ->anon_vma, becaus= e > - * collapse_and_free_pmd() requires that either n= o > - * ->anon_vma exists or the anon_vma is locked. > - * We already checked ->anon_vma above, but that = check > - * is racy because ->anon_vma can be populated un= der the > - * mmap lock in read mode. > - */ > - if (vma->anon_vma) { > - result =3D SCAN_PAGE_ANON; > - goto unlock_next; > - } > - /* > - * When a vma is registered with uffd-wp, we can'= t > - * recycle the pmd pgtable because there can be p= te > - * markers installed. Skip it only, so the rest = mm/vma > - * can still have the same file mapped hugely, ho= wever > - * it'll always mapped in small page size for uff= d-wp > - * registered ranges. > - */ > - if (hpage_collapse_test_exit(mm)) { > - result =3D SCAN_ANY_PROCESS; > - goto unlock_next; > - } > - if (userfaultfd_wp(vma)) { > - result =3D SCAN_PTE_UFFD_WP; > - goto unlock_next; > - } > - collapse_and_free_pmd(mm, vma, addr, pmd); The old code called collapse_and_free_pmd(), which involves MMU notifier invocation... > - if (!cc->is_khugepaged && is_target) > - result =3D set_huge_pmd(vma, addr, pmd, h= page); > - else > - result =3D SCAN_SUCCEED; > - > -unlock_next: > - mmap_write_unlock(mm); > - goto next; > - } > - /* > - * Calling context will handle target mm/addr. Otherwise,= let > - * khugepaged try again later. > - */ > - if (!is_target) { > - khugepaged_add_pte_mapped_thp(mm, addr); > + vma->vm_end < addr + HPAGE_PMD_SIZE) > continue; > - } > -next: > - if (is_target) > - target_result =3D result; > + > + mm =3D vma->vm_mm; > + if (find_pmd_or_thp_or_none(mm, addr, &pmd) !=3D SCAN_SUC= CEED) > + continue; > + > + if (hpage_collapse_test_exit(mm)) > + continue; > + /* > + * When a vma is registered with uffd-wp, we cannot recyc= le > + * the page table because there may be pte markers instal= led. > + * Other vmas can still have the same file mapped hugely,= but > + * skip this one: it will always be mapped in small page = size > + * for uffd-wp registered ranges. > + * > + * What if VM_UFFD_WP is set a moment after this check? = No > + * problem, huge page lock is still held, stopping new ma= ppings > + * of page which might then get replaced by pte markers: = only > + * existing markers need to be protected here. (We could= check > + * after getting ptl below, but this comment distracting = there!) > + */ > + if (userfaultfd_wp(vma)) > + continue; > + > + /* Huge page lock is still held, so page table must be em= pty */ > + pml =3D pmd_lock(mm, pmd); > + ptl =3D pte_lockptr(mm, pmd); > + if (ptl !=3D pml) > + spin_lock_nested(ptl, SINGLE_DEPTH_NESTING); > + pgt_pmd =3D pmdp_collapse_flush(vma, addr, pmd); ... while the new code only does pmdp_collapse_flush(), which clears the pmd entry and does a TLB flush, but AFAICS doesn't use MMU notifiers. My understanding is that that's problematic - maybe (?) it is sort of okay with regards to classic MMU notifier users like KVM, but it's probably wrong for IOMMUv2 users, where an IOMMU directly consumes the normal page tables? (FWIW, last I looked, there also seemed to be some other issues with MMU notifier usage wrt IOMMUv2, see the thread .) > + if (ptl !=3D pml) > + spin_unlock(ptl); > + spin_unlock(pml); > + > + mm_dec_nr_ptes(mm); > + page_table_check_pte_clear_range(mm, addr, pgt_pmd); > + pte_free_defer(mm, pmd_pgtable(pgt_pmd)); > } > - i_mmap_unlock_write(mapping); > - return target_result; > + i_mmap_unlock_read(mapping); > } > > /** > @@ -2261,9 +2210,11 @@ static int collapse_file(struct mm_struct *mm, uns= igned long addr, > > /* > * Remove pte page tables, so we can re-fault the page as huge. > + * If MADV_COLLAPSE, adjust result to call collapse_pte_mapped_th= p(). > */ > - result =3D retract_page_tables(mapping, start, mm, addr, hpage, > - cc); > + retract_page_tables(mapping, start); > + if (cc && !cc->is_khugepaged) > + result =3D SCAN_PTE_MAPPED_HUGEPAGE; > unlock_page(hpage); > > /* > -- > 2.35.3 >