From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0AA2FE7719D for ; Fri, 10 Jan 2025 09:06:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0EF5B8D0002; Fri, 10 Jan 2025 04:06:06 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0A3658D0001; Fri, 10 Jan 2025 04:06:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E0DBF8D0002; Fri, 10 Jan 2025 04:06:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id BC4FD8D0001 for ; Fri, 10 Jan 2025 04:06:05 -0500 (EST) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id 756981408FD for ; Fri, 10 Jan 2025 09:06:05 +0000 (UTC) X-FDA: 82990960290.26.BC58EBF Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf30.hostedemail.com (Postfix) with ESMTP id 6ACE380007 for ; Fri, 10 Jan 2025 09:06:03 +0000 (UTC) Authentication-Results: imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1736499963; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=j8fC0S2Ky5nQ7JVW3umasM6FMGk7syikrRuBk4LusCM=; b=3busk4NKLXf0zhZGJiOsXeJZXhlJhqkO2fhksr2xlReKzZI2v6N5MrVtoazq74rMi5O3/C Cvjk1+OmBP+tTaS3L12SRQ2N9N0LyvafzR5WnaRjEk0lM4OfZMOkJ0JMGluorvyBVqBApQ JMSzrmIPwgSwBHsXh78GksEHABicWNg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1736499963; a=rsa-sha256; cv=none; b=Pf7L1gGvWMk5imGy9LNydQ+TDrI4AE4XHL41c/HpRHNqdMoSZMh0XH+2fOa/wPBsHfrViV eGT8yoL8oGoErF1NovyPoC+CoQMWZ3mso386Ihm0dC8tSsZzOOcofYMCiUF99njQwD+r4k truKlWenqvK4pF1d75j1ufgNprgbHf8= ARC-Authentication-Results: i=1; imf30.hostedemail.com; dkim=none; spf=pass (imf30.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id B30E312FC; Fri, 10 Jan 2025 01:06:30 -0800 (PST) Received: from [10.162.42.21] (K4MQJ0H1H2.blr.arm.com [10.162.42.21]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id A4B1A3F673; Fri, 10 Jan 2025 01:05:50 -0800 (PST) Message-ID: Date: Fri, 10 Jan 2025 14:35:47 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC 08/11] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Nico Pache , linux-kernel@vger.kernel.org, linux-mm@kvack.org Cc: ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org References: <20250108233128.14484-1-npache@redhat.com> <20250108233128.14484-9-npache@redhat.com> Content-Language: en-US From: Dev Jain In-Reply-To: <20250108233128.14484-9-npache@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 6ACE380007 X-Stat-Signature: hpjzikyutbxhq7nwb5wwnac7iarbs996 X-Rspam-User: X-HE-Tag: 1736499963-455843 X-HE-Meta: U2FsdGVkX19JofTWKa7iLhKDpD/UHd3GQuQsYdBOfEp/dDUPLyKTnnEVlZjL4IgE29zpSNug86nhgo3IxPqWTiuOIrRLTg/eDLF6c/2zxjOyg2MdDMkXmKipPWy20K6y34R/ixvPZxzWR3/3Qled9c7FUPunZZljg7LIZ2dwC6dEM1M5VYpeTDgJn4q8aP02jCCgnx7hh8hpcxKPc67BQqFB8f6iJmJPatG7D9jEy4+dcbrshvgSSkpNKz6AnkQ6x3V6F4M6XkT34xYR+9+f9ptZBEwZPdDbqJaXIML4hiuNYIK2TeGWLgy+NyUyg0t7yiQrAmKaNZy1uBs2wYCS8C4Doft9FNx5k+fTiWATxH0rJIwcqSG4xwWlb8ZJWm9d51vrWEpmLmVbd7eLStBEIpdfXvxsoYLqwc436RIFFe9s3TjI6lR2D4CLjGwlsobRHepz+ZmNRTBYmvCyYnMM32lBWIovbeTL+gMJ179QVLa71zvtvbiY4onxBgVZwzNc25Oav/Pb74bz0AVCvKHinN4DchgdOByDEGSbj5NG2ciciXiao12L1tm0puoEOcas+WVNaJsKIvPxcJisExaQuB2MowiyPBOp5zj6cJH2ojHs/5MPq8Ktu9v1IBpb3jZV4TDZgC8wu9M1HXswYE1xAHY6g2p/KVHrcFgA7HzmZMxxAUHeLTlt4flUYsDGGDrqYKtSuY8PZIILxZnK31mZlg1XNg/wBKaIah1mt7TU+bOzuxqCU32hvXy56w6tVvvGyOaBF8gbaRuH4cEmxIaGcGI8FZj/B6TCxp7dfug9wle2MHv/FFXv51BQrW3gHnZA3r/gwRCRwqgAMEdPIdaORh+SDzcTYZLAva2bi4KcrxP6QMVrfOzcYeAtiNxl9oSSIrce8gP19f6dRkV9LY25bvB1Wghxm/T0RrnkWgzGQKBz6LbS2WEok4Ji8BuRM8WtXj47033/3aPjVlfxjSR 2z8DdUgn L06dwb7YJ0ha/nZReCQibJIyz5yD6u17TC2vpjK9HJ9V9RALBwgMbU2N5VovK5d0FD98G3NMxSGR1GeeYbo457PVYWbukWRD2KqWWW9z+yrHFafli3NusJhJCX207cGegI5M6/k/qB9L+w56KE/nuYTM7bFfdS979rVH05xzuHTRweikTpft8+IkIjmKPRLGxcPQMYRXMef+PpkP+WCacs6cldP6cqL4a2eTc1NvA2pKm5rMS0Tg7wH82jhWl/2gw3L/5pXhwgk2lX3iJXkakyRNCCjThvn7g5k0H X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 09/01/25 5:01 am, Nico Pache wrote: > khugepaged scans PMD ranges for potential collapse to a hugepage. To add > mTHP support we use this scan to instead record chunks of fully utilized > sections of the PMD. > > create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks. > by default we will set this to order 3. The reasoning is that for 4K 512 > PMD size this results in a 64 bit bitmap which has some optimizations. > For other arches like ARM64 64K, we can set a larger order if needed. > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap > that represents chunks of fully utilized regions. We can then determine > what mTHP size fits best and in the following patch, we set this bitmap > while scanning the PMD. > > max_ptes_none is used as a scale to determine how "full" an order must > be before being considered for collapse. > > Signed-off-by: Nico Pache > --- > include/linux/khugepaged.h | 4 +- > mm/khugepaged.c | 129 +++++++++++++++++++++++++++++++++++-- > 2 files changed, 126 insertions(+), 7 deletions(-) > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > index 1f46046080f5..31cff8aeec4a 100644 > --- a/include/linux/khugepaged.h > +++ b/include/linux/khugepaged.h > @@ -1,7 +1,9 @@ > /* SPDX-License-Identifier: GPL-2.0 */ > #ifndef _LINUX_KHUGEPAGED_H > #define _LINUX_KHUGEPAGED_H > - Nit: I don't think this line needs to be deleted. > +#define MIN_MTHP_ORDER 3 > +#define MIN_MTHP_NR (1< +#define MTHP_BITMAP_SIZE (1<<(HPAGE_PMD_ORDER - MIN_MTHP_ORDER)) > extern unsigned int khugepaged_max_ptes_none __read_mostly; > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > extern struct attribute_group khugepaged_attr_group; > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > index 9eb161b04ee4..de1dc6ea3c71 100644 > --- a/mm/khugepaged.c > +++ b/mm/khugepaged.c > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, MM_SLOTS_HASH_BITS); > > static struct kmem_cache *mm_slot_cache __ro_after_init; > > +struct scan_bit_state { > + u8 order; > + u8 offset; > +}; > + > struct collapse_control { > bool is_khugepaged; > > @@ -102,6 +107,15 @@ struct collapse_control { > > /* nodemask for allocation fallback */ > nodemask_t alloc_nmask; > + > + /* bitmap used to collapse mTHP sizes. 1bit = order MIN_MTHP_ORDER mTHP */ > + unsigned long *mthp_bitmap; > + unsigned long *mthp_bitmap_temp; > + struct scan_bit_state *mthp_bitmap_stack; > +}; > + > +struct collapse_control khugepaged_collapse_control = { > + .is_khugepaged = true, > }; > > /** > @@ -389,6 +403,25 @@ int __init khugepaged_init(void) > if (!mm_slot_cache) > return -ENOMEM; > > + /* > + * allocate the bitmaps dynamically since MTHP_BITMAP_SIZE is not known at > + * compile time for some architectures. > + */ > + khugepaged_collapse_control.mthp_bitmap = kmalloc_array( > + BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL); > + if (!khugepaged_collapse_control.mthp_bitmap) > + return -ENOMEM; > + > + khugepaged_collapse_control.mthp_bitmap_temp = kmalloc_array( > + BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL); > + if (!khugepaged_collapse_control.mthp_bitmap_temp) > + return -ENOMEM; > + > + khugepaged_collapse_control.mthp_bitmap_stack = kmalloc_array( > + MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL); > + if (!khugepaged_collapse_control.mthp_bitmap_stack) > + return -ENOMEM; > + > khugepaged_pages_to_scan = HPAGE_PMD_NR * 8; > khugepaged_max_ptes_none = HPAGE_PMD_NR - 1; > khugepaged_max_ptes_swap = HPAGE_PMD_NR / 8; > @@ -400,6 +433,9 @@ int __init khugepaged_init(void) > void __init khugepaged_destroy(void) > { > kmem_cache_destroy(mm_slot_cache); > + kfree(khugepaged_collapse_control.mthp_bitmap); > + kfree(khugepaged_collapse_control.mthp_bitmap_temp); > + kfree(khugepaged_collapse_control.mthp_bitmap_stack); > } > > static inline int khugepaged_test_exit(struct mm_struct *mm) > @@ -850,10 +886,6 @@ static void khugepaged_alloc_sleep(void) > remove_wait_queue(&khugepaged_wait, &wait); > } > > -struct collapse_control khugepaged_collapse_control = { > - .is_khugepaged = true, > -}; > - > static bool khugepaged_scan_abort(int nid, struct collapse_control *cc) > { > int i; > @@ -1102,7 +1134,8 @@ static int alloc_charge_folio(struct folio **foliop, struct mm_struct *mm, > > static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > int referenced, int unmapped, > - struct collapse_control *cc) > + struct collapse_control *cc, bool *mmap_locked, > + int order, int offset) > { > LIST_HEAD(compound_pagelist); > pmd_t *pmd, _pmd; > @@ -1115,6 +1148,11 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > struct mmu_notifier_range range; > VM_BUG_ON(address & ~HPAGE_PMD_MASK); > > + /* if collapsing mTHPs we may have already released the read_lock, and > + * need to reaquire it to keep the proper locking order. > + */ > + if (!*mmap_locked) > + mmap_read_lock(mm); There is no need to take the read lock again, because we drop it just after this. > /* > * Before allocating the hugepage, release the mmap_lock read lock. > * The allocation can take potentially a long time if it involves > @@ -1122,6 +1160,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > * that. We will recheck the vma after taking it again in write mode. > */ > mmap_read_unlock(mm); > + *mmap_locked = false; > > result = alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); > if (result != SCAN_SUCCEED) > @@ -1256,12 +1295,71 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, > out_up_write: > mmap_write_unlock(mm); > out_nolock: > + *mmap_locked = false; > if (folio) > folio_put(folio); > trace_mm_collapse_huge_page(mm, result == SCAN_SUCCEED, result); > return result; > } > > +// Recursive function to consume the bitmap > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long address, > + int referenced, int unmapped, struct collapse_control *cc, > + bool *mmap_locked, unsigned long enabled_orders) > +{ > + u8 order, offset; > + int num_chunks; > + int bits_set, max_percent, threshold_bits; > + int next_order, mid_offset; > + int top = -1; > + int collapsed = 0; > + int ret; > + struct scan_bit_state state; > + > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 }; > + > + while (top >= 0) { > + state = cc->mthp_bitmap_stack[top--]; > + order = state.order; > + offset = state.offset; > + num_chunks = 1 << order; > + // Skip mTHP orders that are not enabled > + if (!(enabled_orders >> (order + MIN_MTHP_ORDER)) & 1) > + goto next; > + > + // copy the relavant section to a new bitmap > + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap, offset, > + MTHP_BITMAP_SIZE); > + > + bits_set = bitmap_weight(cc->mthp_bitmap_temp, num_chunks); > + > + // Check if the region is "almost full" based on the threshold > + max_percent = ((HPAGE_PMD_NR - khugepaged_max_ptes_none - 1) * 100) > + / (HPAGE_PMD_NR - 1); > + threshold_bits = (max_percent * num_chunks) / 100; > + > + if (bits_set >= threshold_bits) { > + ret = collapse_huge_page(mm, address, referenced, unmapped, cc, > + mmap_locked, order + MIN_MTHP_ORDER, offset * MIN_MTHP_NR); > + if (ret == SCAN_SUCCEED) > + collapsed += (1 << (order + MIN_MTHP_ORDER)); > + continue; > + } > + > +next: > + if (order > 0) { > + next_order = order - 1; > + mid_offset = offset + (num_chunks / 2); > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { next_order, mid_offset }; > + cc->mthp_bitmap_stack[++top] = (struct scan_bit_state) > + { next_order, offset }; > + } > + } > + return collapsed; > +} > + > static int khugepaged_scan_pmd(struct mm_struct *mm, > struct vm_area_struct *vma, > unsigned long address, bool *mmap_locked, > @@ -1430,7 +1528,7 @@ static int khugepaged_scan_pmd(struct mm_struct *mm, > pte_unmap_unlock(pte, ptl); > if (result == SCAN_SUCCEED) { > result = collapse_huge_page(mm, address, referenced, > - unmapped, cc); > + unmapped, cc, mmap_locked, HPAGE_PMD_ORDER, 0); > /* collapse_huge_page will return with the mmap_lock released */ > *mmap_locked = false; > } > @@ -2767,6 +2865,21 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > return -ENOMEM; > cc->is_khugepaged = false; > > + cc->mthp_bitmap = kmalloc_array( > + BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL); > + if (!cc->mthp_bitmap) > + return -ENOMEM; > + > + cc->mthp_bitmap_temp = kmalloc_array( > + BITS_TO_LONGS(MTHP_BITMAP_SIZE), sizeof(unsigned long), GFP_KERNEL); > + if (!cc->mthp_bitmap_temp) > + return -ENOMEM; > + > + cc->mthp_bitmap_stack = kmalloc_array( > + MTHP_BITMAP_SIZE, sizeof(struct scan_bit_state), GFP_KERNEL); > + if (!cc->mthp_bitmap_stack) > + return -ENOMEM; > + > mmgrab(mm); > lru_add_drain_all(); > > @@ -2831,8 +2944,12 @@ int madvise_collapse(struct vm_area_struct *vma, struct vm_area_struct **prev, > out_nolock: > mmap_assert_locked(mm); > mmdrop(mm); > + kfree(cc->mthp_bitmap); > + kfree(cc->mthp_bitmap_temp); > + kfree(cc->mthp_bitmap_stack); > kfree(cc); > > + > return thps == ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0 > : madvise_collapse_errno(last_fail); > }