From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A33F9E77187 for ; Wed, 18 Dec 2024 09:26:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 173EF6B0082; Wed, 18 Dec 2024 04:26:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 123DB6B0083; Wed, 18 Dec 2024 04:26:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 03A0E6B0085; Wed, 18 Dec 2024 04:26:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id D87D76B0082 for ; Wed, 18 Dec 2024 04:26:51 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 55DB3160BFB for ; Wed, 18 Dec 2024 09:26:51 +0000 (UTC) X-FDA: 82907549298.26.82100DD Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf20.hostedemail.com (Postfix) with ESMTP id 9D25E1C0003 for ; Wed, 18 Dec 2024 09:26:16 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf20.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1734513994; a=rsa-sha256; cv=none; b=XD8W1vahvmqn5kIXw/cCmV5sq5uDIK6THK40ZjfU7+QAXbtp5s2papDV83jZn43oUOqC9x njcK8YB43IXCC1jGULR/J7g6N2Vum689s1LBgqzdB+n+rTJRSAiwCn3svC1sM+uAgMdCvg sNWVmeJ7/3ViGXOHEVpdtP3rTj11fMY= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf20.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1734513994; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=5XtEq03MrUtkZifKDtyMEfx5c4rsyU4dM/+TNLya55w=; b=acSLjV59TNUgrgUp8e0M2m9gwZudIPFEpV3GFIOwehC9C/wjbq9HVik102hCrJkDJLQBJr WHN+2Wl8nIDxupK6aiR4uxT+PK5FFdtZETcWJt1J9+aFxJNFDeenvhdNe1TaTmDNAkpHgX QWmcZiWgEohXEQbLr6cKoLVgPhxdYPw= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D8D98FEC; Wed, 18 Dec 2024 01:27:16 -0800 (PST) Received: from [10.162.42.42] (K4MQJ0H1H2.blr.arm.com [10.162.42.42]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id C5C7B3F7B4; Wed, 18 Dec 2024 01:26:38 -0800 (PST) Message-ID: <999a569d-9424-41b0-b0c8-d028aee2985f@arm.com> Date: Wed, 18 Dec 2024 14:56:35 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 08/12] khugepaged: Abstract PMD-THP collapse To: Ryan Roberts , akpm@linux-foundation.org, david@redhat.com, willy@infradead.org, kirill.shutemov@linux.intel.com Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20241216165105.56185-1-dev.jain@arm.com> <20241216165105.56185-9-dev.jain@arm.com> <097741c8-b2ea-4869-8792-f9663019137c@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: <097741c8-b2ea-4869-8792-f9663019137c@arm.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 9D25E1C0003 X-Stat-Signature: mac8nucjsqtnifyebwiowb1g37d36d9g X-HE-Tag: 1734513976-32850 X-HE-Meta: U2FsdGVkX1/cbfBEk6S5JW2PprMS7sklAdpQ1F/xiV+Cj0i7/hGFrNgPAsyMnnLviwgopNUJTJpURx6TG3sPtJnYX08tGshrziHLliEa7x8Q4yxSZE8z/IHpPqSWK1KDC6hXr0L3fgEvzLT14vUGAWDfADnyDw7D5AJRI5nu99lVcYCu6G2Y5b3DMz801fTlLhpHgm01imrqTLp27bp/qulopLJHdTyMQzHz5Kay0dmeMgD6jbj81aIqQNXyavqZuGgtNdH5tSaSf+m1ONIperC0bKafgy1n3glzssSHFfUHyvUx8oF6jQzGVBxHIDY6oUnUYJsYoJFxF6tNGdyL9+bRNNpUhmJihus3fRjHiZdDq4ClKqdVyHyxNKe7X/yk63n+Mm8+uc0SlrO5OirVjXATc0lVM2O7KofSPsH41CbpDZtjbYu3ABeQpNUYARdYtVk6Lc1FJmN9DhW5bmggG3VSy8lB5B2fCsMWj7rPB0kLMl0chXty8WugGXZWIWnNMereZmXQapvnWrsB6PIeZOvSWf9DA0qKbrcjrFPVps+8GEwvHwHFPOq/0ZhIXO2lXp/pxktnqXygz8RtRP9vhdLHHLX+mQ2o1s3K6Z1Mzr8zVQ5JHDb01w/RzXPDq7WFM9JjNPFxqXnkwTTxCwyfj8NM2goHPLow/w0A0vaTmVfUwTT3+wEALIaS5peFHJGYN90RYNLyECSxkuD7E04OwfhrnGBryDcOdeBsxK71TUBCMpPnAE0OnEmo7MJyYq7icMeb6zpSDeD8b9uu/HJ9gc2pm1mB8wWihBuQo64Do5kYDYHV4eNOVKrwRHaAXRMJbYNDVn4lEjLnlWmpEro/6H3zlnBgRCQoY/tmbLlPi/d1Qbuq/VrCEcLSplKxRQ+EkXX1VR0yq/r4I7UDrPtf42p1qPBP5pyIZot4TOEJSIYhM1+zGbmNNhpvucOIpsCZts4oXmIlQ5S1gnYvUJz uLeQGY3B i8T6lYDNAh9N4I7T7wEMK/TIyMUsBOXwcEG74I671b32HJMiRBbOHHHObH9J2n3IgaQN7SfRRKi5cqP60dtLms0JPiA4y4urPGcRUsETZCmum7VIBMs6LdYm7kIY77oSoT40s03ZPWnY7v/L58t+Co9/3/KpuV9DadN1NyhRrMnwnitWwB2OG7akBkfEmBZ15Yd0Y1FyZdplry8crneb33MAWuwbjQNS40v6CzIh5JDzGWFhpXNKum39s7+Yd2yTJGnRxcy+qGcvgJnSDOZbIgeyYtF7vxWHT9deqJaFikxQxJiqyU3Eedc6e5yRTv36sbI8Ne2xVH3i7YXgU1Wti2ybGP8oggS3M9o39Z279XDzRi1A= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 18/12/24 12:54 am, Ryan Roberts wrote: > On 16/12/2024 16:51, Dev Jain wrote: >> Abstract away taking the mmap_lock exclusively, copying page contents, and >> setting the PMD, into vma_collapse_anon_folio_pmd(). >> >> Signed-off-by: Dev Jain >> --- >> mm/khugepaged.c | 119 +++++++++++++++++++++++++++--------------------- >> 1 file changed, 66 insertions(+), 53 deletions(-) >> >> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >> index 078794aa3335..88beebef773e 100644 >> --- a/mm/khugepaged.c >> +++ b/mm/khugepaged.c >> @@ -1111,58 +1111,17 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, >> return SCAN_SUCCEED; >> } >> >> -static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> - int referenced, int unmapped, int order, >> - struct collapse_control *cc) >> +static int vma_collapse_anon_folio_pmd(struct mm_struct *mm, unsigned long address, >> + struct vm_area_struct *vma, struct collapse_control *cc, pmd_t *pmd, >> + struct folio *folio) >> { >> + struct mmu_notifier_range range; >> + spinlock_t *pmd_ptl, *pte_ptl; >> LIST_HEAD(compound_pagelist); >> - pmd_t *pmd, _pmd; >> - pte_t *pte; >> pgtable_t pgtable; >> - struct folio *folio; >> - spinlock_t *pmd_ptl, *pte_ptl; >> - int result = SCAN_FAIL; >> - struct vm_area_struct *vma; >> - struct mmu_notifier_range range; >> - >> - VM_BUG_ON(address & ~HPAGE_PMD_MASK); >> - >> - /* >> - * Before allocating the hugepage, release the mmap_lock read lock. >> - * The allocation can take potentially a long time if it involves >> - * sync compaction, and we do not need to hold the mmap_lock during >> - * that. We will recheck the vma after taking it again in write mode. >> - */ >> - mmap_read_unlock(mm); >> - >> - result = alloc_charge_folio(&folio, mm, order, cc); >> - if (result != SCAN_SUCCEED) >> - goto out_nolock; >> - >> - mmap_read_lock(mm); >> - result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); >> - if (result != SCAN_SUCCEED) { >> - mmap_read_unlock(mm); >> - goto out_nolock; >> - } >> - >> - result = find_pmd_or_thp_or_none(mm, address, &pmd); >> - if (result != SCAN_SUCCEED) { >> - mmap_read_unlock(mm); >> - goto out_nolock; >> - } >> - >> - if (unmapped) { >> - /* >> - * __collapse_huge_page_swapin will return with mmap_lock >> - * released when it fails. So we jump out_nolock directly in >> - * that case. Continuing to collapse causes inconsistency. >> - */ >> - result = __collapse_huge_page_swapin(mm, vma, address, pmd, >> - referenced, order); >> - if (result != SCAN_SUCCEED) >> - goto out_nolock; >> - } >> + int result; >> + pmd_t _pmd; >> + pte_t *pte; >> >> mmap_read_unlock(mm); >> /* >> @@ -1174,7 +1133,8 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> * mmap_lock. >> */ >> mmap_write_lock(mm); >> - result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); >> + >> + result = hugepage_vma_revalidate(mm, address, true, &vma, HPAGE_PMD_ORDER, cc); >> if (result != SCAN_SUCCEED) >> goto out_up_write; >> /* check if the pmd is still valid */ >> @@ -1206,7 +1166,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); >> if (pte) { >> result = __collapse_huge_page_isolate(vma, address, pte, cc, >> - &compound_pagelist, order); >> + &compound_pagelist, HPAGE_PMD_ORDER); >> spin_unlock(pte_ptl); >> } else { >> result = SCAN_PMD_NULL; >> @@ -1262,11 +1222,64 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> deferred_split_folio(folio, false); >> spin_unlock(pmd_ptl); >> >> - folio = NULL; >> - >> result = SCAN_SUCCEED; >> out_up_write: >> mmap_write_unlock(mm); >> + return result; >> +} >> + >> +static int collapse_huge_page(struct mm_struct *mm, unsigned long address, >> + int referenced, int unmapped, int order, >> + struct collapse_control *cc) >> +{ >> + struct vm_area_struct *vma; >> + int result = SCAN_FAIL; >> + struct folio *folio; >> + pmd_t *pmd; >> + >> + /* >> + * Before allocating the hugepage, release the mmap_lock read lock. >> + * The allocation can take potentially a long time if it involves >> + * sync compaction, and we do not need to hold the mmap_lock during >> + * that. We will recheck the vma after taking it again in write mode. >> + */ >> + mmap_read_unlock(mm); >> + >> + result = alloc_charge_folio(&folio, mm, order, cc); >> + if (result != SCAN_SUCCEED) >> + goto out_nolock; >> + >> + mmap_read_lock(mm); >> + result = hugepage_vma_revalidate(mm, address, true, &vma, order, cc); >> + if (result != SCAN_SUCCEED) { >> + mmap_read_unlock(mm); >> + goto out_nolock; >> + } >> + >> + result = find_pmd_or_thp_or_none(mm, address, &pmd); >> + if (result != SCAN_SUCCEED) { >> + mmap_read_unlock(mm); >> + goto out_nolock; >> + } >> + >> + if (unmapped) { >> + /* >> + * __collapse_huge_page_swapin will return with mmap_lock >> + * released when it fails. So we jump out_nolock directly in >> + * that case. Continuing to collapse causes inconsistency. >> + */ >> + result = __collapse_huge_page_swapin(mm, vma, address, pmd, >> + referenced, order); >> + if (result != SCAN_SUCCEED) >> + goto out_nolock; >> + } >> + >> + if (order == HPAGE_PMD_ORDER) >> + result = vma_collapse_anon_folio_pmd(mm, address, vma, cc, pmd, folio); > I think the locking is broken here? collapse_huge_page() used to enter with the > mmamp read lock and exit without the lock held at all. After the change, this is > only true for order == HPAGE_PMD_ORDER. For other orders, you exit with the mmap > read lock still held. Perhaps: > > else > mmap_read_unlock(mm); I have completely messed up sequential building of patches :) I will take care next time. > >> + >> + if (result == SCAN_SUCCEED) >> + folio = NULL; >> + >> out_nolock: >> if (folio) >> folio_put(folio);