From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9E413C369D3 for ; Wed, 23 Apr 2025 08:25:17 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EE3176B00AA; Wed, 23 Apr 2025 04:25:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E6A5A6B00AB; Wed, 23 Apr 2025 04:25:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D0D6C6B00AC; Wed, 23 Apr 2025 04:25:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id ACD466B00AA for ; Wed, 23 Apr 2025 04:25:16 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id A921E5FCC6 for ; Wed, 23 Apr 2025 08:25:16 +0000 (UTC) X-FDA: 83364623832.21.1349016 Received: from out30-101.freemail.mail.aliyun.com (out30-101.freemail.mail.aliyun.com [115.124.30.101]) by imf24.hostedemail.com (Postfix) with ESMTP id C190C18000C for ; Wed, 23 Apr 2025 08:25:13 +0000 (UTC) Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=lT92eofT; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745396715; a=rsa-sha256; cv=none; b=XhnlVuNmy+yx6SnRBp//dhdB286UAu6/oNuunydTOO9xFvooOyxG95DP9xio93mpE7FsgA EzHruEAfyvsNw+zBuIxqzujYwPF4sF8/0zPR+50Yb8PLw5ooHA4ZLFESRe5ze41MUACW4G LUkBd8NgUgyjmFrS/H0V6ADPYCO1mD0= ARC-Authentication-Results: i=1; imf24.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=lT92eofT; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf24.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.101 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745396715; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5h+k2wbF4kzuCsYfv9KUS0UYTa/NR4y+C+q8jVAnr4Y=; b=qXT8Fc5E+/OBumj487nqzGOR8WSYxvd8UXEX4SdiOyWsH0gV0G8K+oTck4egLU5Y9FlWm0 HM7deKOVuCOZbA9DxysIZl2PYm0LGRzwWGKu8IabMFRTagWWChB06D5aoMcRhBG4dayIHT IZWrpUMzDiahmYhWAkXkfgrOOOkK0dY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1745396710; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=5h+k2wbF4kzuCsYfv9KUS0UYTa/NR4y+C+q8jVAnr4Y=; b=lT92eofTf4I+W2LILj1ajYB7ix0u5ddTjXJXrOgVbEj6UhQP8lIzgh9Fv/X41oeCGcq52c4eHDuDttyO2UvA/hv6KGYw618BmRE5WC4hGaRTHSLQyDJ20ipzaFT8Vc3+7qd2gECDZU/9s/q0i2eeyA2Eki9qF1xAmgE1w5zadMA= Received: from 30.74.144.121(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WXtk4mp_1745396707 cluster:ay36) by smtp.aliyun-inc.com; Wed, 23 Apr 2025 16:25:08 +0800 Message-ID: Date: Wed, 23 Apr 2025 16:25:06 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 05/12] khugepaged: generalize __collapse_huge_page_* for mTHP support To: Nico Pache Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org References: <20250417000238.74567-1-npache@redhat.com> <20250417000238.74567-6-npache@redhat.com> From: Baolin Wang In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: C190C18000C X-Rspamd-Server: rspam04 X-Stat-Signature: w1eh3fj1nayab5zf3i33q3b59wndda9t X-HE-Tag: 1745396713-380050 X-HE-Meta: U2FsdGVkX1+kE0H6dRTyPBMSq+yEafzRUnCMF4mOT2QS7oinrYWJD3AKtVdYoWX6sbJAU5pTingU/Rd71khM/r32a5DVWq8bBA6PPaST6JWdHDjAyMXaV0EEuAefIz3f6AhEy6SPJc99OGhVm+C4KMN6bs8kbvrQ7Ykg8PlkypP8LuCyvjDGIANjU+11oXg/65PdY+gCBtWTGw3JOTnvFmY4pS32eoun5ojxXsHau0Z+WytHT4rlsqjDQOzzF1W3PXXa2LFzksZLz8/4cAh0Vilm0j3yvcB61dUhyQ9QjN0VuTumbBz4KsMjDsnJRMVBoUJ9VFE70Vh9CjDSSxmzvMs3yJ5wQqilNwp+AYVFTEoinl6sdJ3dzwcRVlHzFMgEGWzGuEQHUK0R3/hiXYOs38TxKYr7Fx1SnPy3DjoYgtgozYdpfqME2QacAWOI0wRC1IhLq4JfDwpolOp6/YSqdsR/FqoKDPXw1aM1zMA05PAkwvYr/6cL/fZHwbPBDR6W/A38WKsHHlQDsr/iSPf09mcM5DUA3Ij5iVDKwho1CAG3+SuIcqJgafOp0dXKCTy+nGx9lh2jZppfXi1B/6nfllfAirvwt7N96ljrynoRj9HYZvADv7XDcNT1ufJyGxCnRZ0OQ80b2o7kMCzPgl+RLDQ9N7UBT+sTpy5juepZNIYTyTwQsYhrcLeL/upqGc4WwOSr3TxOKVfq0YF5guZKUwSYxPx6o85HsU0WS+oewDToCqv5I6nVdFC26gzouL0S8tmEsXUZX6eabqXqmbmamz1QZGu9gqIcdcj2chfqPiFsz+v3Z4YgJxPAIhGVzHPnozdgaghqnRT+eVxYkbxVBJs+dvGm7GdTQxCoi9bCdf3zN4ybG8MaL/PUe1OIFwm7daphec+gu3gC0xh1mEls4hJG8WpjHZuxnnUeNOnVx1Ytx9wflAqbQtzsEcd1IUdrKt/NXbtap2h1ijTkVKL 3WyyjZzV zSKH8CtrVjVWOW7FVCUMZse+EKEs/5LdTbC1xNA5xlNibqgSfJAyNmtCZxvpJmV4KtQSSpjlMwJ7E1cfryX8OTbdIMWOT7nJ/Z2xOBZv02ZJ8fL8SGzDfY5uA7VzRGtRkjmbTP4VR0Z9lpsDSSes7PGPW/sdc04LnXZ0vM3NxHAmeRgwqged/+FP8vMtYaKv7GlwojEhC7NOhaGVe0ZIwlFWjpBZf2mTKIiHRGE3G9ZArVoJuFVoud6MZ18wfaKMnjVfe7ZxvIK5S7PiCdBjEa9VFxiTZaIgRi5qJIr/QJpvEUrpLTLCjQh+KXAbENFvWowDoJGmsFrMnzRWlttu2+sH8Yk2aoNUB6KoWEQD8Bn88YhzzGM2gYdanfnA/H8q3AiC8rxGOy3Yn6u11bF2cRnvt/OH1oXPQ3cQFsLG7BSFS81kBpKGSsD9kT5X6GnC4DTs9EC86+D0xYmDoyoR3rdqhM6vRl0rZdUYbWBejNMCPES/wwGEXX1nSMw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/23 16:00, Nico Pache wrote: > On Wed, Apr 23, 2025 at 1:30 AM Baolin Wang > wrote: >> >> >> >> On 2025/4/17 08:02, Nico Pache wrote: >>> generalize the order of the __collapse_huge_page_* functions >>> to support future mTHP collapse. >>> >>> mTHP collapse can suffer from incosistant behavior, and memory waste >>> "creep". disable swapin and shared support for mTHP collapse. >>> >>> No functional changes in this patch. >>> >>> Co-developed-by: Dev Jain >>> Signed-off-by: Dev Jain >>> Signed-off-by: Nico Pache >>> --- >>> mm/khugepaged.c | 46 ++++++++++++++++++++++++++++------------------ >>> 1 file changed, 28 insertions(+), 18 deletions(-) >>> >>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c >>> index 883e9a46359f..5e9272ab82da 100644 >>> --- a/mm/khugepaged.c >>> +++ b/mm/khugepaged.c >>> @@ -565,15 +565,17 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> unsigned long address, >>> pte_t *pte, >>> struct collapse_control *cc, >>> - struct list_head *compound_pagelist) >>> + struct list_head *compound_pagelist, >>> + u8 order) >>> { >>> struct page *page = NULL; >>> struct folio *folio = NULL; >>> pte_t *_pte; >>> int none_or_zero = 0, shared = 0, result = SCAN_FAIL, referenced = 0; >>> bool writable = false; >>> + int scaled_none = khugepaged_max_ptes_none >> (HPAGE_PMD_ORDER - order); >>> >>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; >>> + for (_pte = pte; _pte < pte + (1 << order); >>> _pte++, address += PAGE_SIZE) { >>> pte_t pteval = ptep_get(_pte); >>> if (pte_none(pteval) || (pte_present(pteval) && >>> @@ -581,7 +583,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> ++none_or_zero; >>> if (!userfaultfd_armed(vma) && >>> (!cc->is_khugepaged || >>> - none_or_zero <= khugepaged_max_ptes_none)) { >>> + none_or_zero <= scaled_none)) { >>> continue; >>> } else { >>> result = SCAN_EXCEED_NONE_PTE; >>> @@ -609,8 +611,8 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma, >>> /* See hpage_collapse_scan_pmd(). */ >>> if (folio_maybe_mapped_shared(folio)) { >>> ++shared; >>> - if (cc->is_khugepaged && >>> - shared > khugepaged_max_ptes_shared) { >>> + if (order != HPAGE_PMD_ORDER || (cc->is_khugepaged && >>> + shared > khugepaged_max_ptes_shared)) { >>> result = SCAN_EXCEED_SHARED_PTE; >>> count_vm_event(THP_SCAN_EXCEED_SHARED_PTE); >>> goto out; >>> @@ -711,13 +713,14 @@ static void __collapse_huge_page_copy_succeeded(pte_t *pte, >>> struct vm_area_struct *vma, >>> unsigned long address, >>> spinlock_t *ptl, >>> - struct list_head *compound_pagelist) >>> + struct list_head *compound_pagelist, >>> + u8 order) >>> { >>> struct folio *src, *tmp; >>> pte_t *_pte; >>> pte_t pteval; >>> >>> - for (_pte = pte; _pte < pte + HPAGE_PMD_NR; >>> + for (_pte = pte; _pte < pte + (1 << order); >>> _pte++, address += PAGE_SIZE) { >>> pteval = ptep_get(_pte); >>> if (pte_none(pteval) || is_zero_pfn(pte_pfn(pteval))) { >>> @@ -764,7 +767,8 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, >>> pmd_t *pmd, >>> pmd_t orig_pmd, >>> struct vm_area_struct *vma, >>> - struct list_head *compound_pagelist) >>> + struct list_head *compound_pagelist, >>> + u8 order) >>> { >>> spinlock_t *pmd_ptl; >>> >>> @@ -781,7 +785,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, >>> * Release both raw and compound pages isolated >>> * in __collapse_huge_page_isolate. >>> */ >>> - release_pte_pages(pte, pte + HPAGE_PMD_NR, compound_pagelist); >>> + release_pte_pages(pte, pte + (1 << order), compound_pagelist); >>> } >>> >>> /* >>> @@ -802,7 +806,7 @@ static void __collapse_huge_page_copy_failed(pte_t *pte, >>> static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, >>> pmd_t *pmd, pmd_t orig_pmd, struct vm_area_struct *vma, >>> unsigned long address, spinlock_t *ptl, >>> - struct list_head *compound_pagelist) >>> + struct list_head *compound_pagelist, u8 order) >>> { >>> unsigned int i; >>> int result = SCAN_SUCCEED; >>> @@ -810,7 +814,7 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, >>> /* >>> * Copying pages' contents is subject to memory poison at any iteration. >>> */ >>> - for (i = 0; i < HPAGE_PMD_NR; i++) { >>> + for (i = 0; i < (1 << order); i++) { >>> pte_t pteval = ptep_get(pte + i); >>> struct page *page = folio_page(folio, i); >>> unsigned long src_addr = address + i * PAGE_SIZE; >>> @@ -829,10 +833,10 @@ static int __collapse_huge_page_copy(pte_t *pte, struct folio *folio, >>> >>> if (likely(result == SCAN_SUCCEED)) >>> __collapse_huge_page_copy_succeeded(pte, vma, address, ptl, >>> - compound_pagelist); >>> + compound_pagelist, order); >>> else >>> __collapse_huge_page_copy_failed(pte, pmd, orig_pmd, vma, >>> - compound_pagelist); >>> + compound_pagelist, order); >>> >>> return result; >>> } >>> @@ -1000,11 +1004,11 @@ static int check_pmd_still_valid(struct mm_struct *mm, >>> static int __collapse_huge_page_swapin(struct mm_struct *mm, >>> struct vm_area_struct *vma, >>> unsigned long haddr, pmd_t *pmd, >>> - int referenced) >>> + int referenced, u8 order) >>> { >>> int swapped_in = 0; >>> vm_fault_t ret = 0; >>> - unsigned long address, end = haddr + (HPAGE_PMD_NR * PAGE_SIZE); >>> + unsigned long address, end = haddr + (PAGE_SIZE << order); >>> int result; >>> pte_t *pte = NULL; >>> spinlock_t *ptl; >>> @@ -1035,6 +1039,12 @@ static int __collapse_huge_page_swapin(struct mm_struct *mm, >>> if (!is_swap_pte(vmf.orig_pte)) >>> continue; >>> >>> + /* Dont swapin for mTHP collapse */ >>> + if (order != HPAGE_PMD_ORDER) { >>> + result = SCAN_EXCEED_SWAP_PTE; >>> + goto out; >>> + } >> >> IMO, this check should move into hpage_collapse_scan_pmd(), that means >> if we scan the swap ptes for mTHP collapse, then we can return >> 'SCAN_EXCEED_SWAP_PTE' to abort the collapse earlier. > I dont think this is correct. We currently abort if the global > max_swap_ptes or max_shared_ptes is exceeded during the PMD scan. > However if those pass (and we dont collapse at the PMD level), we will > continue to mTHP collapses. Then during the isolate function we check > for shared ptes in this specific mTHP range and abort if there's a > shared ptes. For swap we only know that some pages in the PMD are > unmapped, but we arent sure which, so we have to try and fault the > PTEs, and if it's a swap pte, and we are on mTHP collapse, we abort > the collapse attempt. So having swap/shared PTEs in the PMD scan, does > not indicate that ALL mTHP collapses will fail, but some will. Yes, you are right! I misread the code (I thought the changes were in hpage_collapse_scan_pmd()). Sorry for the noise. Feel free to add: Reviewed-by: Baolin Wang