From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id BBF79C369CB for ; Sun, 27 Apr 2025 02:51:54 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id C37D26B0005; Sat, 26 Apr 2025 22:51:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BC0196B0007; Sat, 26 Apr 2025 22:51:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A5FD66B0008; Sat, 26 Apr 2025 22:51:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 84F9C6B0005 for ; Sat, 26 Apr 2025 22:51:51 -0400 (EDT) Received: from smtpin20.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id C0A6E1A1AEB for ; Sun, 27 Apr 2025 02:51:52 +0000 (UTC) X-FDA: 83378298864.20.8BA59E6 Received: from out30-112.freemail.mail.aliyun.com (out30-112.freemail.mail.aliyun.com [115.124.30.112]) by imf15.hostedemail.com (Postfix) with ESMTP id A8230A0005 for ; Sun, 27 Apr 2025 02:51:49 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=jI42XU4P; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745722310; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=iEt3uhcsxnbov9JPbGYyEFMzZDrl62NJAHFLHUNnAIs=; b=wxh8vFyOr9LA8zoo2B87NaVR8QPCydgpkkJOmzxG1etu14RRJfViFwEtmlMD2+x/oSpslM x6K7PmWftP2RjxDYgISwmrmOfNs8N+ccL3Mdve0qwCTx9lc38KYZVuG7McHaONVLztvq+B AdWNNr+TtYCpcrraEt4k7yQeFdyer4Q= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=linux.alibaba.com header.s=default header.b=jI42XU4P; dmarc=pass (policy=none) header.from=linux.alibaba.com; spf=pass (imf15.hostedemail.com: domain of baolin.wang@linux.alibaba.com designates 115.124.30.112 as permitted sender) smtp.mailfrom=baolin.wang@linux.alibaba.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745722310; a=rsa-sha256; cv=none; b=ic+OEq50dLQ5Rxcq4JsxiVhkdnv/ab/huXUhKxCBDoSDC9IClRvH6PHt/RkEs7AMOH5Jmo MhAmwerae8DmvhqxYugpTEtzoX7ujr9jKSUOYX/gswRyoZAT2qXMWLD8+NQgHeJKxXaVKx R34sAa/NCodRHdmX0PxkusinLY15vJY= DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1745722306; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=iEt3uhcsxnbov9JPbGYyEFMzZDrl62NJAHFLHUNnAIs=; b=jI42XU4PolD6V9sv13zIs6lIdnOCQv/KKsRqZ3F+scWqPj3jk6GpqRi3fYL2bdEW829HrCyxNI0Jk0kHuB6r9JBj9zs64T7BK203Jd1zOwRtyrjkcOM5KZlKHq49v7bQ/lTWSnZ/QnvUelHGO6ZXzWBAOsHQ969s27fr0OOhvOM= Received: from 30.74.144.120(mailfrom:baolin.wang@linux.alibaba.com fp:SMTPD_---0WY7FM4o_1745722302 cluster:ay36) by smtp.aliyun-inc.com; Sun, 27 Apr 2025 10:51:43 +0800 Message-ID: <5de38fe3-4a73-443b-87d1-0c836ffdbe30@linux.alibaba.com> Date: Sun, 27 Apr 2025 10:51:42 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v4 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Nico Pache , linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org Cc: akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org References: <20250417000238.74567-1-npache@redhat.com> <20250417000238.74567-7-npache@redhat.com> From: Baolin Wang In-Reply-To: <20250417000238.74567-7-npache@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Stat-Signature: fgnczou4y1cmnuuwjpr357u4j775i56g X-Rspamd-Queue-Id: A8230A0005 X-Rspam-User: X-Rspamd-Server: rspam05 X-HE-Tag: 1745722309-773467 X-HE-Meta: U2FsdGVkX1/X8ZW4MKt7WilUHnDfGPE63o5E6YWW0r3g+TVk5aX97Oz62kAACk2P1HYif1rn40/JbRepqDEqivdkoaNPwyBJ6C/zI8DLJ9yxA0EFdHHiavIHJduxMjuAeKGFyCA37f1spm9/8T7cuTe/CMUwoAjJtLK217dqhDv+6zH8jt9SvSi3SUuNNHwCoAcFzuo6qObv2Y7vmkUUwo02G8Gb90XYzfiH/4v+/ISK6xu+ZTk/VK11zflBNyXZFT4MiSQF+o0z64ZqW1c8M2kC7PHRCVqdHMs7QW8fzZs7an+13TeimMtqf0tMEWUyc5Db+E8TjaECXltcSHfTRvrEJgpioQojsWC+6hgNDlGkPPIPt7cO2vX/tAGJ+eFCbtadFKOs3kTrQFaJwFT/Z6nkn5gxl4uvgTBOezTzIR5bDywWxUGLZIO284I/sEm7KhME0rdxS1pMNoShix/kXoIfv+Kz7jDXlDB1XfSOfEDVq1maYqc+afja5oD+xBLn/jwcHqW8q7Mo1UhUurJ0cFM23klwAiMQPPetp6k6/SZxiAgYZjnvhVjb7w9h6fmwUbDuXzHmzJaogDDaS9Jm1CrDwU46q+TQd9sWwrIkuNp/bnP/ipDTSrHZYaU6Fe54Rk8HeVr1GMAZDMPIKeYaoeoPACbFBOzp+T3CcBV1epNRpG9EkjUA/LHKURYakCu4DbXuxu7e++Sj5r/KxWPiksOf0J862jkIhF1RyYBkH6Fabk4QcA+Awe4n1g+3Wc7iIwv2Y2S2TTJjZfTljfdETc52CkuZwC9sPS5SxSOVe9A6PE0u4jn5EPpt9bGSIvYzCrjWJK8aAbClvhEKdZZNBFA5pCMH3WrRtzkNgL+h4HjR6Qj3vOK6VJ4/aXf60UPtPJwkUjPmlnyBiy8Ctf1QPyufdK+COlwmt3nusteXtXvAr+YbYhfqa2mutjpXW4NVXZdh3QxbTsIJoOE6zIK W1u645fO pOQTPXqrkOqXYSuoamDk3knkio/woLqJLmXU7EJCjUzKY+Ektz7TOInGSnqQs8LlrDDEXWzqlacXKS6bTToiaVZSa3B+3gzYSUbcz4VmQRAKKGg0rK6x7LRZNJNHn2Q5SAhilcPiF03Y7ExgdcJC/PbWZ8H2jhEeqt6x40eiOx8M7INWulgdYSDpn7C5WSgR6Ft5Rj4G8Cv/TTq844u/ogVDoH/q6Yn4X9eGRmChjtu0ORZUVi3SH4ssGst8164+ekWZfp0hLtG+P0W/qi5R9sGFH3hWRwLinirmvLTF5nSuVgPtdnVsetVCOkg80IGCpVKksNiOW4uiOWQkN5vY1zGCqI2vuKEPfAcebP28O5+I79ZLy1XZzW5Lqgsh4P9gUsNb/ri+k4etgZsE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 2025/4/17 08:02, Nico Pache wrote: > khugepaged scans PMD ranges for potential collapse to a hugepage. To add > mTHP support we use this scan to instead record chunks of utilized > sections of the PMD. > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap > that represents chunks of utilized regions. We can then determine what > mTHP size fits best and in the following patch, we set this bitmap while > scanning the PMD. > > max_ptes_none is used as a scale to determine how "full" an order must > be before being considered for collapse. > > When attempting to collapse an order that has its order set to "always" > lets always collapse to that order in a greedy manner without > considering the number of bits set. > > Signed-off-by: Nico Pache > --- > include/linux/khugepaged.h | 4 ++ > mm/khugepaged.c | 94 ++++++++++++++++++++++++++++++++++---- > 2 files changed, 89 insertions(+), 9 deletions(-) > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > index 1f46046080f5..18fe6eb5051d 100644 > --- a/include/linux/khugepaged.h > +++ b/include/linux/khugepaged.h > @@ -1,6 +1,10 @@ > /* SPDX-License-Identifier: GPL-2.0 */ > #ifndef _LINUX_KHUGEPAGED_H > #define _LINUX_KHUGEPAGED_H > +#define KHUGEPAGED_MIN_MTHP_ORDER 2 Why is the minimum mTHP order set to 2? IMO, the file large folios can support order 1, so we don't expect to collapse exec file small folios to order 1 if possible? (PS: I need more time to understand your logic in this patch, and any additional explanation would be helpful:) ) > +#define KHUGEPAGED_MIN_MTHP_NR (1< +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPAGED_MIN_MTHP_ORDER)) > +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER)) > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 5e9272ab82da..83230e9cdf3a 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > static struct kmem_cache *mm_slot_cache __ro_after_init; > > +struct scan_bit_state { > + u8 order; > + u16 offset; > +}; > + > struct collapse_control { > bool is_khugepaged; > > @@ -102,6 +107,18 @@ struct collapse_control { > > /* nodemask for allocation fallback */ > nodemask_t alloc_nmask; > + > + /* > + * bitmap used to collapse mTHP sizes. > + * 1bit = order KHUGEPAGED_MIN_MTHP_ORDER mTHP > + */ > + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE); > + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE); > + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE]; > +}; > + > +struct collapse_control khugepaged_collapse_control = { > + .is_khugepaged = true, > }; > > /** > @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void) > remove_wait_queue(&khugepaged_wait, &wait); > } > > -struct collapse_control khugepaged_collapse_control = { > - .is_khugepaged = true, > -}; > - > static bool khugepaged_scan_abort(int nid, struct collapse_control *cc) > { > int i; > @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, > > static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > int referenced, int unmapped, > - struct collapse_control *cc) > + struct collapse_control *cc, bool *mmap_locked, > + u8 order, u16 offset) > { > LIST_HEAD(compound_pagelist); > pmd_t *pmd, _pmd; > @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > * The allocation can take potentially a long time if it involves > * sync compaction, and we do not need to hold the mmap_lock during > * that. We will recheck the vma after taking it again in write mode. > + * If collapsing mTHPs we may have already released the read_lock. > */ > - mmap_read_unlock(mm); > + if (*mmap_locked) { > + mmap_read_unlock(mm); > + *mmap_locked = false; > + } > > result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); > if (result != SCAN_SUCCEED) > @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > out_up_write: > mmap_write_unlock(mm); > out_nolock: > + *mmap_locked = false; > if (folio) > folio_put(folio); > trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); > return result; > } > > +// Recursive function to consume the bitmap > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address, > + int referenced, int unmapped, struct collapse_control *cc, > + bool *mmap_locked, unsigned long enabled_orders) > +{ > + u8 order, next_order; > + u16 offset, mid_offset; > + int num_chunks; > + int bits_set, threshold_bits; > + int top = -1; > + int collapsed = 0; > + int ret; > + struct scan_bit_state state; > + bool is_pmd_only = (enabled_orders == (1 << HPAGE_PMD_ORDER)); > + > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 }; > + > + while (top >= 0) { > + state = cc->mthp_bitmap_stack[top--]; > + order = state.order + KHUGEPAGED_MIN_MTHP_ORDER; > + offset = state.offset; > + num_chunks = 1 << (state.order); > + // Skip mTHP orders that are not enabled > + if (!test_bit(order, &enabled_orders)) > + goto next; > + > + // copy the relavant section to a new bitmap > + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset, > + MTHP_BITMAP_SIZE); > + > + bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks); > + threshold_bits = (HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) > + >> (HPAGE_PMD_ORDER - state.order); > + > + //Check if the region is "almost full" based on the threshold > + if (bits_set > threshold_bits || is_pmd_only > + || test_bit(order, &huge_anon_orders_always)) { > + ret = collapse_huge_page(mm, address, referenced, unmapped, cc, > + mmap_locked, order, offset * KHUGEPAGED_MIN_MTHP_NR); > + if (ret == SCAN_SUCCEED) { > + collapsed += (1 << order); > + continue; > + } > + } > + > +next: > + if (state.order > 0) { > + next_order = state.order - 1; > + mid_offset = offset + (num_chunks / 2); > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { next_order, mid_offset }; > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { next_order, offset }; > + } > + } > + return collapsed; > +} > + > static int khugepaged_scan_pmd(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long address, bool *mmap_locked, > @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > pte_unmap_unlock(pte, ptl); > if (result == SCAN_SUCCEED) { > result = collapse_huge_page(mm, address, referenced, > - unmapped, cc); > - /* collapse_huge_page will return with the mmap_lock released */ > - *mmap_locked = false; > + unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0); > } > out: > trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenced,