From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id E1818E7717F for ; Tue, 17 Dec 2024 19:24:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1C6416B007B; Tue, 17 Dec 2024 14:24:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 176AF6B0082; Tue, 17 Dec 2024 14:24:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 03CC16B0083; Tue, 17 Dec 2024 14:24:27 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id DA71A6B007B for ; Tue, 17 Dec 2024 14:24:27 -0500 (EST) Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 503A7AF2DE for ; Tue, 17 Dec 2024 19:24:27 +0000 (UTC) X-FDA: 82905426450.22.38A3D1C Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf05.hostedemail.com (Postfix) with ESMTP id 40D4E100011 for ; Tue, 17 Dec 2024 19:23:26 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734463442; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=pJLHYXfTfKmEP7j79XF52al/odUy/fGYHLo/aXwAAw8=; b=pZmDNDTrfRlEmBmqGRNIBlAYJgDRnoHVn2gc4Abza9V3KM98JUof8ef4Rz8b3UqSaL7hcV ymLgGSnYYJydBlGe6VUBrBvyXE27bnsKNrvV3XPf3ZMjdaJN90jOE+R7x5vq1pMzl0FtD8 Sd1As4i3/zAG99eLd3sDoNdvHreVR6Y= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf05.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734463442; a=rsa-sha256; cv=none; b=WhgSfYw2+VkeddDnHEjJD+I94eiBCYay4pBg8eMWKirHvNtl+GqMDeegZX7Dn+NCrSymdj jUJuYD/C+/8suDektE8mewPOSONdRr5jruRZDobZK9gHaAgYMXZpKs9728VJ6fIeHUYlxF TVlFQeQADWZRaHx7VmrI+poIJa5FKVk= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 4FA5C339; Tue, 17 Dec 2024 11:24:52 -0800 (PST) Received: from [10.57.91.184] (unknown [10.57.91.184]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id EDD9E3F528; Tue, 17 Dec 2024 11:24:18 -0800 (PST) Message-ID: <097741c8-b2ea-4869-8792-f9663019137c@arm.com> Date: Tue, 17 Dec 2024 19:24:17 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse Content-Language: en-GB To: Dev Jain , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-9-dev.jain@arm.com> From: Ryan Roberts In-Reply-To: <20241216165105.56185-9-dev.jain@arm.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 40D4E100011 X-Rspamd-Server: rspam12 X-Stat-Signature: kpdusiar5636p1zagsbogjitopyr36a7 X-Rspam-User: X-HE-Tag: 1734463406-968870 X-HE-Meta: U2FsdGVkX1+OF6vKoPacwyLyx90arnkHCmn2RWgr7eFHy4Zj76+YjiP6zSXz7VzfuA8LjvToUWC64J0KMvxmyTVjwP4XhtNwq0qaaXOOwxXilNfu/F7TAHG7YDcMl+SGWGN7pKeBwoeOzwaP/NDmOdhUx8pVJNobFjOK77aOKFfFUKO2T4zlipVeVf42Yqp/+HKs4c52vhhZocgXeEHyCSjeCGR+mh7PB+B5QY7JHk0GwjGB/u8rqcW5atC16pMV4GkCW8CX35XaiLycJfK/U6uMIYPTEaWJ58v9Rf1+g/HbfmNCiwE+HtAzfd0pM6Md1NUZr5zPZg+166nBck1HqkHCz4UdB3VkoqbY6Bs3is3MaQhMNBiYfQxoN5Kh03ZRztM2cuYzTqWswswIGKqPNl7RFiCM2O0t5DYeBcPs7NzEphqRSN5zSvnp2KGPPhzjibbSRunxwHb5L3F/r3x1KXMvedAatkO/CAHHsIgNBFnEdvTknwvtqDcBgzSIZqEW5uGwzY2kPjao/mrkWivEbZhdjBL8PFd4rhE0kh9Nhy5WubnANJfutO+0PRpQFlRC77YTZLhxNaxJLIJDJmv/y/3/FUlHzNwYESnHffAgkRgKO2NInkVpnCkAaJdFrzGfaqRu80FF5hB3latldE0P7+7daPa64evvIts8WuA46jCNgnCzOBJGiF2evLN+lki8E3w/NDI9rpKjDQppzoLaElqwhG7yFmiC3+7DJPJ/sH8lNg9ZMDncV8jEpQHnqgoqYaU4msCRw4Szb8GFNOKdfFJORAogaOYuOp7h8v0ZLvBeyzahH13dpXW01MUtMuB1mbpKDpSck0nOVTNJ1QtprxEHG5LzGnXjm5Ih+9hnQKH+vkHefpgtvWgr4qM+J3gxS4dwoKd/BCNvQnPhzzrYlsaWDQfHfK3vT0B+IM0dEdXcCGOW8eHZfwFvCG+C2IpXPuXH3zBdny7t+BOl9L7 S5zB4m9m Lxg4fgBMRjmxFR6JOC+Sx4OF/aj6L/h79mcdHofaf0VtmCp65cV+JvYcen48E95qKFXwwGPKvOrQor89A0+Prf5FhsF3JiZ7xlbC8UUK+xsWJNqjwFwmzdGi78r9hIT2Am54IgYA13rZjDp9Fi9NpE4ATIByB3UC31vQzfcYUpJunn4k0G9L1cbx2HKgS3s8NLyMMd3EdRuoPxlqbgiJJKilIzbx34Znt7s96Bf1U2aCGg7sRHD13lkMT214Gs6/VrJ7RAJPUODeLiBGNpvuxOwjbUTvsHvIWC9f68Bhg/bN/rBH+zAz8dc/Xxb24ELP98uuw0qP7qcr26gwKZ0ROvMe/o5NyyuJA7AHwLymBUfJVaMc= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 16/12/2024 16:51, Dev Jain wrote: > Abstract away taking the mmap_lock exclusively, copying page contents, and > setting the PMD, into vma_collapse_anon_folio_pmd(). > > Signed-off-by: Dev Jain > --- > mm/khugepaged.c | 119 +++++++++++++++++++++++++++--------------------- > 1 file changed, 66 insertions(+), 53 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 078794aa3335..88beebef773e 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -1111,58 +1111,17 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, > return SCAN_SUCCEED; > } > > -static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > - int referenced, int unmapped, int order, > - struct collapse_control *cc) > +static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address, > + struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, > + struct folio *folio) > { > + struct mmu_notifier_range range; > + spinlock_t *pmd_ptl, *pte_ptl; > LIST_HEAD(compound_pagelist); > - pmd_t *pmd, _pmd; > - pte_t *pte; > pgtable_t pgtable; > - struct folio *folio; > - spinlock_t *pmd_ptl, *pte_ptl; > - int result = SCAN_FAIL; > - struct vm_area_struct *vma; > - struct mmu_notifier_range range; > - > - VM_BUG_ON(address & ~HPAGE_PMD_MASK); > - > - /* > - * Before allocating the hugepage, release the mmap_lock read lock. > - * The allocation can take potentially a long time if it involves > - * sync compaction, and we do not need to hold the mmap_lock during > - * that. We will recheck the vma after taking it again in write mode. > - */ > - mmap_read_unlock(mm); > - > - result = alloc_charge_folio(&folio, mm, order, cc); > - if (result != SCAN_SUCCEED) > - goto out_nolock; > - > - mmap_read_lock(mm); > - result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); > - if (result != SCAN_SUCCEED) { > - mmap_read_unlock(mm); > - goto out_nolock; > - } > - > - result = find_pmd_or_thp_or_none(mm, address, &pmd); > - if (result != SCAN_SUCCEED) { > - mmap_read_unlock(mm); > - goto out_nolock; > - } > - > - if (unmapped) { > - /* > - * __collapse_huge_page_swapin will return with mmap_lock > - * released when it fails. So we jump out_nolock directly in > - * that case. Continuing to collapse causes inconsistency. > - */ > - result = __collapse_huge_page_swapin(mm, vma, address, pmd, > - referenced, order); > - if (result != SCAN_SUCCEED) > - goto out_nolock; > - } > + int result; > + pmd_t _pmd; > + pte_t *pte; > > mmap_read_unlock(mm); > /* > @@ -1174,7 +1133,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > * mmap_lock. > */ > mmap_write_lock(mm); > - result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); > + > + result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc); > if (result != SCAN_SUCCEED) > goto out_up_write; > /* check if the pmd is still valid */ > @@ -1206,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); > if (pte) { > result = __collapse_huge_page_isolate(vma, address, pte, cc, > - &compound_pagelist, order); > + &compound_pagelist, HPAGE_PMD_ORDER); > spin_unlock(pte_ptl); > } else { > result = SCAN_PMD_NULL; > @@ -1262,11 +1222,64 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > deferred_split_folio(folio, false); > spin_unlock(pmd_ptl); > > - folio = NULL; > - > result = SCAN_SUCCEED; > out_up_write: > mmap_write_unlock(mm); > + return result; > +} > + > +static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > + int referenced, int unmapped, int order, > + struct collapse_control *cc) > +{ > + struct vm_area_struct *vma; > + int result = SCAN_FAIL; > + struct folio *folio; > + pmd_t *pmd; > + > + /* > + * Before allocating the hugepage, release the mmap_lock read lock. > + * The allocation can take potentially a long time if it involves > + * sync compaction, and we do not need to hold the mmap_lock during > + * that. We will recheck the vma after taking it again in write mode. > + */ > + mmap_read_unlock(mm); > + > + result = alloc_charge_folio(&folio, mm, order, cc); > + if (result != SCAN_SUCCEED) > + goto out_nolock; > + > + mmap_read_lock(mm); > + result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); > + if (result != SCAN_SUCCEED) { > + mmap_read_unlock(mm); > + goto out_nolock; > + } > + > + result = find_pmd_or_thp_or_none(mm, address, &pmd); > + if (result != SCAN_SUCCEED) { > + mmap_read_unlock(mm); > + goto out_nolock; > + } > + > + if (unmapped) { > + /* > + * __collapse_huge_page_swapin will return with mmap_lock > + * released when it fails. So we jump out_nolock directly in > + * that case. Continuing to collapse causes inconsistency. > + */ > + result = __collapse_huge_page_swapin(mm, vma, address, pmd, > + referenced, order); > + if (result != SCAN_SUCCEED) > + goto out_nolock; > + } > + > + if (order == HPAGE_PMD_ORDER) > + result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio); I think the locking is broken here? collapse_huge_page() used to enter with the mmamp read lock and exit without the lock held at all. After the change, this is only true for order == HPAGE_PMD_ORDER. For other orders, you exit with the mmap read lock still held. Perhaps: else mmap_read_unlock(mm); > + > + if (result == SCAN_SUCCEED) > + folio = NULL; > + > out_nolock: > if (folio) > folio_put(folio);