From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47213C021AA for ; Wed, 19 Feb 2025 15:39:34 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C45FB280239; Wed, 19 Feb 2025 10:39:33 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id BF54028022F; Wed, 19 Feb 2025 10:39:33 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id AE37B280239; Wed, 19 Feb 2025 10:39:33 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 8F36228022F for ; Wed, 19 Feb 2025 10:39:33 -0500 (EST) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 3D5DEA3208 for ; Wed, 19 Feb 2025 15:39:33 +0000 (UTC) X-FDA: 83137103826.13.E925BED Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf16.hostedemail.com (Postfix) with ESMTP id 6A859180013 for ; Wed, 19 Feb 2025 15:39:31 +0000 (UTC) Authentication-Results: imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739979571; a=rsa-sha256; cv=none; b=gfrjssygF/wiExuEalHEa7/Fh66w9Otfdwy5z3m3CLqTT6AKTKVRHc3+gFdW9s/S+DtB77 r05E4DEJuG2zIEddxi5LMGZrjH+ENDDAEpB0xM4PLhOyJQjxRX1+7yRagpUloNN7luu6Fv RQyXsusaG51O/TLV70qC9OtsuEqfYyQ= ARC-Authentication-Results: i=1; imf16.hostedemail.com; dkim=none; dmarc=pass (policy=none) header.from=arm.com; spf=pass (imf16.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739979571; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=/vAXc5sINO0y1+93lGkO2drRoy1kHpiIWKQVD5z6SwE=; b=eFWJKDCVN2T5CSwwdsxYZ3rmx1BSTAUb2VpABqsnvcmyEuBjRhp91TA1VRbtiXV7BXDId0 n3m4IRU7W1ZPt/zRW8JyfzjPO+kJj/Qjh2pvl/DFCuXKyOH1S4HGccennD4pe4tZMHfrZH P7OQkIhrxMaBYWqBjZuXqJnZXTKcJPY= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 98D761682; Wed, 19 Feb 2025 07:39:48 -0800 (PST) Received: from [10.57.84.233] (unknown [10.57.84.233]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 5747A3F59E; Wed, 19 Feb 2025 07:39:18 -0800 (PST) Message-ID: <8524c7c7-024f-4f17-9b89-ef9aedfca672@arm.com> Date: Wed, 19 Feb 2025 15:39:15 +0000 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v2 5/9] khugepaged: generalize __collapse_huge_page_* for mTHP support Content-Language: en-GB To: Nico Pache , linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org Cc: anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de References: <20250211003028.213461-1-npache@redhat.com> <20250211003028.213461-6-npache@redhat.com> From: Ryan Roberts In-Reply-To: <20250211003028.213461-6-npache@redhat.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 6A859180013 X-Stat-Signature: btnacokqu5h11gapwuiei7ffi6z137h8 X-Rspam-User: X-HE-Tag: 1739979571-298199 X-HE-Meta: U2FsdGVkX19/dWEaMVtEhFksH6PqR3yFLQJLH6Z6Qg1k9sFNDTLMclpHnY3l2uF5/Dyn/Y6ZoMQR/1HMDVsC2Kw5Pio+Isqk8gi4voEtELsr5uNzGPvIMR/EnUX5gkRsuHu+r0nrNoTYXYFgo5ab1xdhiijVJRv/LSqOGAMASOTe1xyeuh1dvpkvekOWyP3SlfF2b2ZEeQQlqsN3WdRV9HDlCxYwlGadGS9WiIvBvMJkcgL465TZ5a93pr3tpOXLv0JXiJqS6t8pGHSePRMCuJZct9gtehuptwq6/qaE9FnKiQ9RxThLbSKAUfKlWNW6ib+k7wDzcjhwkdcIa2kZw0bB+BM4LaIqmwOdX/aqu86p6TNuhKnH9HsmDTTUa8VJozaO/jLDaqTpVbE9DRAfeszWxEde0pOeM8bB0oJeVAgZU6jHzZaC6Ydfy7EhLNt4ok7MnddOkpvfGGHps6Ne3z490PwMW5IW6lF1bySKO+ZkmrJzN9NnF197UPR8E6jpUiY2gXu/0pioTF1agLre+COv1czAOxiVD9TOIS8c6BrwVkPAOS4QiikKWSfVvcSnNdw23ij0IhQL/giaN1i/T+zC7OhYhuEPTgB/XUnI5RICp/4rSrwvvru6Rmo7Nux+FVHzAP6YoFUn3TADYHTHu2CIqVAXfbF5vbCH/P8F5FqjpLQvVMSwSkRdGheYTsvVygXPpxt7aYW1UE5NTXAK+y6Zy8EjKtBI5XQ0YvH32xhhDktU4vPEH/MF+wBHVy5r5hzr0zuD4CCE4rCBoRWXjeZHswrTZFB6PbGbrGsYT1HAU3J3XzmIQtGlOlFfvUUW/vjZh+iO+RO8SvEDsMkVeCWM07YGSC1OLrXoPkZG7mZ70uiBVlI+2PeKx27/T0Nb8lZaOoCJRIlo3hegcS3uSvrDpUwXCqTeAnmkmdlXMHDgHYKnfSwk69/ZQGtysHTu449ji4m/3YWa/OZwdT6 dXSdz2K0 Bm7c8i0FpZWy5xxdwCKEcgWsMSfQ0KqjvBlUWXDoMQ+78fZvMLD1noTlwT8+3Qjw6xaAvE07/Hka2qnaEUCldyFyChdrhYWzZjlYNFhei0YLjOQ3ooYwjHXOpG+YxPpkhm2aANfM7F+NblOMpypwtqslOyFsuCoTaVzSaXwnRvTllrb9IMMSoySaGcZY/nTR/GI28zRMqd/jcXyqQQ1W6FU2ns+EmKDC4XCNy/9axaIaAbCKnPATSRnMJSutNmqceWbBCXjrQRUCl/jy3yqddznZe0ywAYFpcNiWa X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 11/02/2025 00:30, Nico Pache wrote: > generalize the order of the __collapse_huge_page_* functions > to support future mTHP collapse. > > mTHP collapse can suffer from incosistant behavior, and memory waste > "creep". disable swapin and shared support for mTHP collapse. > > No functional changes in this patch. > > Signed-off-by: Nico Pache > --- > mm/khugepaged.c | 48 ++++++++++++++++++++++++++++-------------------- > 1 file changed, 28 insertions(+), 20 deletions(-) > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 0cfcdc11cabd..3776055bd477 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > unsigned long address, > pte_t *pte, > struct collapse_control *cc, > - struct list_head *compound_pagelist) > + struct list_head *compound_pagelist, > + u8 order) nit: I think we are mostly standardised on order being int. Is there any reason to make it u8 here? > { > struct page *page = NULL; > struct folio *folio = NULL; > pte_t *_pte; > int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0; > bool writable = false; > + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > + for (_pte = pte; _pte < pte + (1 << order); > _pte++, address += PAGE_SIZE) { > pte_t pteval = ptep_get(_pte); > if (pte_none(pteval) || (pte_present(pteval) && > @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > ++none_or_zero; > if (!userfaultfd_armed(vma) && > (!cc->is_khugepaged || > - none_or_zero <= khugepaged_max_ptes_none)) { > + none_or_zero <= scaled_none)) { > continue; > } else { > result = SCAN_EXCEED_NONE_PTE; > @@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, > /* See khugepaged_scan_pmd(). */ > if (folio_likely_mapped_shared(folio)) { > ++shared; > - if (cc->is_khugepaged && > - shared > khugepaged_max_ptes_shared) { > + if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged && > + shared > khugepaged_max_ptes_shared)) { > result = SCAN_EXCEED_SHARED_PTE; > count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); Same comment about events; I think you will want to be careful to only count events for PMD-sized THP using count_vm_event() and introduce equivalent MTHP events to cover all sizes. > goto out; > @@ -711,14 +713,15 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, > struct vm_area_struct *vma, > unsigned long address, > spinlock_t *ptl, > - struct list_head *compound_pagelist) > + struct list_head *compound_pagelist, > + u8 order) > { > struct folio *src, *tmp; > pte_t *_pte; > pte_t pteval; > > - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; > - _pte++, address += PAGE_SIZE) { > + for (_pte = pte; _pte < pte + (1 << order); > + _pte++, address += PAGE_SIZE) { nit: you changed the indentation here. > pteval = ptep_get(_pte); > if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { > add_mm_counter(vma->vm_mm, MM_ANONPAGES, 1); > @@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > pmd_t *pmd, > pmd_t orig_pmd, > struct vm_area_struct *vma, > - struct list_head *compound_pagelist) > + struct list_head *compound_pagelist, > + u8 order) > { > spinlock_t *pmd_ptl; > > @@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > * Release both raw and compound pages isolated > * in __collapse_huge_page_isolate. > */ > - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); > + release_pte_pages(pte, pte + (1 << order), compound_pagelist); > } > > /* > @@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, > static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, > pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, > unsigned long address, spinlock_t *ptl, > - struct list_head *compound_pagelist) > + struct list_head *compound_pagelist, u8 order) > { > unsigned int i; > int result = SCAN_SUCCEED; > @@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, > /* > * Copying pages' contents is subject to memory poison at any iteration. > */ > - for (i = 0; i < HPAGE_PMD_NR; i++) { > + for (i = 0; i < (1 << order); i++) { > pte_t pteval = ptep_get(pte + i); > struct page *page = folio_page(folio, i); > unsigned long src_addr = address + i * PAGE_SIZE; > @@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, > > if (likely(result == SCAN_SUCCEED)) > __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, > - compound_pagelist); > + compound_pagelist, order); > else > __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, > - compound_pagelist); > + compound_pagelist, order); > > return result; > } > @@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm, > static int __collapse_huge_page_swapin(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long haddr, pmd_t *pmd, > - int referenced) > + int referenced, u8 order) > { > int swapped_in = 0; > vm_fault_t ret = 0; > - unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); > + unsigned long address, end = haddr + (PAGE_SIZE << order); > int result; > pte_t *pte = NULL; > spinlock_t *ptl; > @@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, > if (!is_swap_pte(vmf.orig_pte)) > continue; > > + if (order != HPAGE_PMD_ORDER) { > + result = SCAN_EXCEED_SWAP_PTE; > + goto out; > + } A comment to explain the rationale for this divergent behaviour based on order would be helpful. > + > vmf.pte = pte; > vmf.ptl = ptl; > ret = do_swap_page(&vmf); > @@ -1114,7 +1123,6 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > int result = SCAN_FAIL; > struct vm_area_struct *vma; > struct mmu_notifier_range range; > - nit: no need for this whitespace change? > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > /* > @@ -1149,7 +1157,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > * that case. Continuing to collapse causes inconsistency. > */ > result = __collapse_huge_page_swapin(mm, vma, address, pmd, > - referenced); > + referenced, HPAGE_PMD_ORDER); > if (result != SCAN_SUCCEED) > goto out_nolock; > } > @@ -1196,7 +1204,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > pte = pte_offset_map_lock(mm, &_pmd, address, &pte_ptl); > if (pte) { > result = __collapse_huge_page_isolate(vma, address, pte, cc, > - &compound_pagelist); > + &compound_pagelist, HPAGE_PMD_ORDER); > spin_unlock(pte_ptl); > } else { > result = SCAN_PMD_NULL; > @@ -1226,7 +1234,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > > result = __collapse_huge_page_copy(pte, folio, pmd, _pmd, > vma, address, pte_ptl, > - &compound_pagelist); > + &compound_pagelist, HPAGE_PMD_ORDER); > pte_unmap(pte); > if (unlikely(result != SCAN_SUCCEED)) > goto out_up_write;