From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9D93AC369D5 for ; Mon, 28 Apr 2025 14:47:48 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 1727F6B00A6; Mon, 28 Apr 2025 10:47:47 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 121BD6B00A8; Mon, 28 Apr 2025 10:47:47 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E90C66B00A9; Mon, 28 Apr 2025 10:47:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id BD9AA6B00A6 for ; Mon, 28 Apr 2025 10:47:46 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 2166ABF09D for ; Mon, 28 Apr 2025 14:47:47 +0000 (UTC) X-FDA: 83383731774.21.1D5F719 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id E274AA000A for ; Mon, 28 Apr 2025 14:47:44 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FfK0nyNy; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1745851665; a=rsa-sha256; cv=none; b=8EGe8nTBDBECdMezXoDTlnlNOKSdd3abG8klT7oocDQIWrMPxwC0GLtOQG81+vhW9tn052 +mMe332HxPBha+SQbAAWlh8qE1NZZCd/18czLwfK78T3HG1+oFAjA0KNXiC27r4ZflP/4C 5cp7KdfKzwQyIRWczHSiunwq6a7Fa8Y= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1745851665; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ACNt1upMe6tBP1vArcgHmRSi246KnVFiBA1nlkLjZ1Y=; b=Xhrrb+jUFANhfEj5/G+vCNjmqQhLY9ClrEQMbZKffpIWYX2m75oV3UNvU8QsVkGtC37pgx 1RlYXWJbzLBvjXUhVeuMP7eUVgZIPgN7pIZSo+ij6msWIDh+BMzlPGqkL9VilZa0YklZLJ w3LiEsYDl+ZiBZJwQje4iSrlLV6M0HY= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=FfK0nyNy; dmarc=pass (policy=quarantine) header.from=redhat.com; spf=pass (imf15.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1745851664; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ACNt1upMe6tBP1vArcgHmRSi246KnVFiBA1nlkLjZ1Y=; b=FfK0nyNyLZSaffJMZGP+Qz5hGnpxInoSjEKhm1ujNZvOLH1o93MAIgmZC+kDgRYd+NfEwl PvHAObBjKo3zVnZKm/pIBYMxZpW6+UbJhob38r207DVbOtHx+fklAHTQbbIk1RBLDGZf+U jDOzCIg5t7QcloNTqgIafp63GKrQLJY= Received: from mail-yw1-f197.google.com (mail-yw1-f197.google.com [209.85.128.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-379-Uu0AHVcOPk2xRYFIU4tY2Q-1; Mon, 28 Apr 2025 10:47:42 -0400 X-MC-Unique: Uu0AHVcOPk2xRYFIU4tY2Q-1 X-Mimecast-MFC-AGG-ID: Uu0AHVcOPk2xRYFIU4tY2Q_1745851661 Received: by mail-yw1-f197.google.com with SMTP id 00721157ae682-7081121c3f7so54188137b3.1 for ; Mon, 28 Apr 2025 07:47:42 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1745851661; x=1746456461; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ACNt1upMe6tBP1vArcgHmRSi246KnVFiBA1nlkLjZ1Y=; b=spKeXTTarjaausakNq2IQRU3SsOQo8tsoVjGwLXrNVWiIxk23T4i69MURjysjrIGjH iG+b+raRJlr00TLRwBbtqaKygJq+U2bxiNr5JJtVW2kVTFQStSY6X6rTe7qF3yMWBxLv WZtbnjMirfKT0xgNMND1Anrhh/aiFbK5P28nIPDRR+kedLhx9R4UlvKtErsVpXx/mXIw LHRQZ3O0a/sEzv9dXKqOeDQkqlpbKwv4CU4J6qL98VRXrWu94+zrIKyiBY0A/GPwSm8+ AvTCR3z9V0TvF/60FJcd8qYEnl32WW1eOAPBgOfTU/mkAFoCY7S2LhWYKeXmbpYqG5wE wkyg== X-Gm-Message-State: AOJu0Yw7/WMvsw0m3wgqVlTK/1wymrfa6cA2bJYxR25yGJTQvzsj0YMT k2wn+yHtgMxMJebFenHCLEwYnaxtFJOg9TcfYepffPZ/O+zDr7wcmI+feAoPvNG/Mp5vP/3KvoQ AQP/vAadGXvKSJWdCiN1IDN2aV/OFJ7QT/S8sHW7pjKgZotJSob35CA5KxIn6N8dabFoPvfwqWU GvBqnV/Zo+uInHRaB1LZe05/c= X-Gm-Gg: ASbGncu1fnffEoII44Focz1iy45sugqw8zX+JPJD96OKpdLAdgppMeGCE93mPb7X922 C9WjM5JT4B5NVI4kP59rau9w9BNN/Ev0wsYdtE10MUxoaxVmNWfDEOEKEwgLn6Rdj1VRxRV281P 04BQZCAog= X-Received: by 2002:a05:690c:6b89:b0:6fb:9445:d28e with SMTP id 00721157ae682-708543f3f8bmr150019337b3.10.1745851661543; Mon, 28 Apr 2025 07:47:41 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFVkmkE4Vcb08AcXyxikoE1TU4PLRHPVJn8yw3P8lrANxyblubLndQgNm2O8lhMvIXPZ36Drl3GMiypgwM8vns= X-Received: by 2002:a05:690c:6b89:b0:6fb:9445:d28e with SMTP id 00721157ae682-708543f3f8bmr150018817b3.10.1745851661161; Mon, 28 Apr 2025 07:47:41 -0700 (PDT) MIME-Version: 1.0 References: <20250417000238.74567-1-npache@redhat.com> <20250417000238.74567-7-npache@redhat.com> <5de38fe3-4a73-443b-87d1-0c836ffdbe30@linux.alibaba.com> In-Reply-To: <5de38fe3-4a73-443b-87d1-0c836ffdbe30@linux.alibaba.com> From: Nico Pache Date: Mon, 28 Apr 2025 08:47:15 -0600 X-Gm-Features: ATxdqUFjy2qJlYlOt48ZRabeCVJgfUsYAeyPhh9bvAeZHSwObmUxYSjsHfeo0u8 Message-ID: Subject: Re: [PATCH v4 06/12] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Baolin Wang Cc: linux-mm@kvack.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, akpm@linux-foundation.org, corbet@lwn.net, rostedt@goodmis.org, mhiramat@kernel.org, mathieu.desnoyers@efficios.com, david@redhat.com, baohua@kernel.org, ryan.roberts@arm.com, willy@infradead.org, peterx@redhat.com, ziy@nvidia.com, wangkefeng.wang@huawei.com, usamaarif642@gmail.com, sunnanyong@huawei.com, vishal.moola@gmail.com, thomas.hellstrom@linux.intel.com, yang@os.amperecomputing.com, kirill.shutemov@linux.intel.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, tiwai@suse.de, will@kernel.org, dave.hansen@linux.intel.com, jack@suse.cz, cl@gentwo.org, jglisse@google.com, surenb@google.com, zokeefe@google.com, hannes@cmpxchg.org, rientjes@google.com, mhocko@suse.com, rdunlap@infradead.org X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: sh9Nloh25Gn8cW0Ixe2gcbgAYJFrGkgBjzuq5_tcT4A_1745851661 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: E274AA000A X-Stat-Signature: i1bkxza6n7qzk638r5m55igtwne7odii X-Rspam-User: X-HE-Tag: 1745851664-993884 X-HE-Meta: U2FsdGVkX18HbMWZF53VGmFmv+X+hHDpq6HWxdXg1KpHlANafgROFNxARFP5/V/E/+6pRNpz7+Tx0yNjRjnqHbIEhC+1vq4FZfN44SziYFQiFfg+5CopUMID0KVfv7zffM8jT93tfkAJBIOiCxRH4y4XROmWl9DlNiR2oIncG9zKxc9Mf/K0dIXCKc2/7giFp5KbWutyWRuZwCl7UmIC3o3Ex9UU0yfsh8ofEtz3cOK/NRMBsbzCabwGJHZQ2iF+GsuzUcmvdw8/udhfEV0zBMlX1CLJjJw9WQymvqUcp9DZiGjkAB+c6P91hzEjoB0pq57PqscEEbSEZLSlCaU/5i9c4wIFeEXUnQ8wDJcmxKaVlq+2e71+bldNg8xSDItahG2Jv+rEOAmIRW4F7G8TeuwwNYFsfl51Wwc6i7HCrkQmoz+716L1pWK4yatt5pERZg/XnP2npc01uTyyFl2BZHa19klUmuwSQjqd98CxSEb1VLcaQVVzOdYxxK4hPezFbXhOQeq56Z53oa8CO8BKl/F18heJ7Ltn66SQ2pSJzFVPU2E47Zc++huAAAokwS0Nqp0iMdt7MPVOHEnO1UnM5eA6yQsZUrN2s2mTfr2t0Df6gzvwTCAYi/EIn4sYsaqGgi0/RUpjr1cl8aJEmscPsj5/F6NIq9HsmJ4VBFdUtN6W9Fsk8CesLTH9w1+Obbl7cyd/affGW5LaP5Vm7FWPXHSKjiBL7MrnfcxuTTuY5jiDRDgTSmnFqpjLk9yxxcQfnprr8yij2UdtcbdWfK2CTV8wnEPiRkVhxU0Wu+a33nbewdUylYylnblhUAetouwv4zovKay5C/FuTfq5IfICDsH624u1GCYBXnqyz7HDQiAgZPL2+EhaZRXNHxR9XtO8dB/a9Vvj02hoG/XtLhic2yjMn1QO7EnaOHWXtvUKBix/z+JIyryqgx5qiZcB7t0oFFPkdOX84TQRCLdZqDx uIEKglYT G4vzbVvsXM5keg91024enIgSyWgGS7agKAOhuPVEiHgZ7b/pBue34vux78+8jyuW3zzc9IznFG2QyyIk= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Apr 26, 2025 at 8:52=E2=80=AFPM Baolin Wang wrote: > > > > On 2025/4/17 08:02, Nico Pache wrote: > > khugepaged scans PMD ranges for potential collapse to a hugepage. To ad= d > > mTHP support we use this scan to instead record chunks of utilized > > sections of the PMD. > > > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap > > that represents chunks of utilized regions. We can then determine what > > mTHP size fits best and in the following patch, we set this bitmap whil= e > > scanning the PMD. > > > > max_ptes_none is used as a scale to determine how "full" an order must > > be before being considered for collapse. > > > > When attempting to collapse an order that has its order set to "always" > > lets always collapse to that order in a greedy manner without > > considering the number of bits set. > > > > Signed-off-by: Nico Pache > > --- > > include/linux/khugepaged.h | 4 ++ > > mm/khugepaged.c | 94 ++++++++++++++++++++++++++++++++++---= - > > 2 files changed, 89 insertions(+), 9 deletions(-) > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > index 1f46046080f5..18fe6eb5051d 100644 > > --- a/include/linux/khugepaged.h > > +++ b/include/linux/khugepaged.h > > @@ -1,6 +1,10 @@ > > /* SPDX-License-Identifier: GPL-2.0 */ > > #ifndef _LINUX_KHUGEPAGED_H > > #define _LINUX_KHUGEPAGED_H > > +#define KHUGEPAGED_MIN_MTHP_ORDER 2 > > Why is the minimum mTHP order set to 2? IMO, the file large folios can > support order 1, so we don't expect to collapse exec file small folios > to order 1 if possible? I should have been more specific in the patch notes, but this affects anonymous only. I'll go over my commit messages and make sure this is reflected in the next version. > > (PS: I need more time to understand your logic in this patch, and any > additional explanation would be helpful:) ) We are currently scanning ptes in a PMD. The core principle/reasoning behind the bitmap is to keep the PMD scan while saving its state. We then use this bitmap to determine which chunks of the PMD are active and are the best candidates for mTHP collapse. We start at the PMD level, and recursively break down the bitmap to find the appropriate sizes for the bitmap. looking at a simplified example: we scan a PMD and get the following bitmap, 1111101101101011 (in this case MIN_MTHP_ORDER=3D 5, so each bit =3D=3D 32 ptes, in the actual set each bit =3D=3D 4 ptes). We would first attempt a PMD collapse, while checking the number of bits set vs the max_ptes_none tunable. If those conditions arent triggered, we will try the next enabled mTHP order, for each half of the bitmap. ie) order 8 attempt on 11111011 and order 8 attempt on 01101011. If a collapse succeeds we dont keep recursing on that portion of the bitmap. If not, we continue attempting lower orders. Hopefully that helps you understand my logic here! Let me know if you need more clarification. I gave a presentation on this that might help too: https://docs.google.com/presentation/d/1w9NYLuC2kRcMAwhcashU1LWTfmI5TIZRaTW= uZq-CHEg/edit?usp=3Dsharing&resourcekey=3D0-nBAGld8cP1kW26XE6i0Bpg Cheers, -- Nico > > > +#define KHUGEPAGED_MIN_MTHP_NR (1< > +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - KHUGEPA= GED_MIN_MTHP_ORDER)) > > +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP= _ORDER)) > > > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 5e9272ab82da..83230e9cdf3a 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, M= M_SLOTS_HASH_BITS); > > > > static struct kmem_cache *mm_slot_cache __ro_after_init; > > > > +struct scan_bit_state { > > + u8 order; > > + u16 offset; > > +}; > > + > > struct collapse_control { > > bool is_khugepaged; > > > > @@ -102,6 +107,18 @@ struct collapse_control { > > > > /* nodemask for allocation fallback */ > > nodemask_t alloc_nmask; > > + > > + /* > > + * bitmap used to collapse mTHP sizes. > > + * 1bit =3D order KHUGEPAGED_MIN_MTHP_ORDER mTHP > > + */ > > + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE); > > + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE); > > + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE]; > > +}; > > + > > +struct collapse_control khugepaged_collapse_control =3D { > > + .is_khugepaged =3D true, > > }; > > > > /** > > @@ -851,10 +868,6 @@ static void khugepaged_alloc_sleep(void) > > remove_wait_queue(&khugepaged_wait, &wait); > > } > > > > -struct collapse_control khugepaged_collapse_control =3D { > > - .is_khugepaged =3D true, > > -}; > > - > > static bool khugepaged_scan_abort(int nid, struct collapse_control *c= c) > > { > > int i; > > @@ -1118,7 +1131,8 @@ static int alloc_charge_folio(struct folio **foli= op, struct mm_struct *mm, > > > > static int collapse_huge_page(struct mm_struct *mm, unsigned long add= ress, > > int referenced, int unmapped, > > - struct collapse_control *cc) > > + struct collapse_control *cc, bool *mmap_loc= ked, > > + u8 order, u16 offset) > > { > > LIST_HEAD(compound_pagelist); > > pmd_t *pmd, _pmd; > > @@ -1137,8 +1151,12 @@ static int collapse_huge_page(struct mm_struct *= mm, unsigned long address, > > * The allocation can take potentially a long time if it involves > > * sync compaction, and we do not need to hold the mmap_lock duri= ng > > * that. We will recheck the vma after taking it again in write m= ode. > > + * If collapsing mTHPs we may have already released the read_lock= . > > */ > > - mmap_read_unlock(mm); > > + if (*mmap_locked) { > > + mmap_read_unlock(mm); > > + *mmap_locked =3D false; > > + } > > > > result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); > > if (result !=3D SCAN_SUCCEED) > > @@ -1273,12 +1291,72 @@ static int collapse_huge_page(struct mm_struct = *mm, unsigned long address, > > out_up_write: > > mmap_write_unlock(mm); > > out_nolock: > > + *mmap_locked =3D false; > > if (folio) > > folio_put(folio); > > trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, resul= t); > > return result; > > } > > > > +// Recursive function to consume the bitmap > > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long = address, > > + int referenced, int unmapped, struct collapse_con= trol *cc, > > + bool *mmap_locked, unsigned long enabled_orders) > > +{ > > + u8 order, next_order; > > + u16 offset, mid_offset; > > + int num_chunks; > > + int bits_set, threshold_bits; > > + int top =3D -1; > > + int collapsed =3D 0; > > + int ret; > > + struct scan_bit_state state; > > + bool is_pmd_only =3D (enabled_orders =3D=3D (1 << HPAGE_PMD_ORDER= )); > > + > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit_state) > > + { HPAGE_PMD_ORDER - KHUGEPAGED_MIN_MTHP_ORDER, 0 }; > > + > > + while (top >=3D 0) { > > + state =3D cc->mthp_bitmap_stack[top--]; > > + order =3D state.order + KHUGEPAGED_MIN_MTHP_ORDER; > > + offset =3D state.offset; > > + num_chunks =3D 1 << (state.order); > > + // Skip mTHP orders that are not enabled > > + if (!test_bit(order, &enabled_orders)) > > + goto next; > > + > > + // copy the relavant section to a new bitmap > > + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap,= offset, > > + MTHP_BITMAP_SIZE); > > + > > + bits_set =3D bitmap_weight(cc->mthp_bitmap_temp, num_chun= ks); > > + threshold_bits =3D (HPAGE_PMD_NR - khugepaged_max_ptes_no= ne - 1) > > + >> (HPAGE_PMD_ORDER - state.order); > > + > > + //Check if the region is "almost full" based on the thres= hold > > + if (bits_set > threshold_bits || is_pmd_only > > + || test_bit(order, &huge_anon_orders_always)) { > > + ret =3D collapse_huge_page(mm, address, reference= d, unmapped, cc, > > + mmap_locked, order, offset * KHUG= EPAGED_MIN_MTHP_NR); > > + if (ret =3D=3D SCAN_SUCCEED) { > > + collapsed +=3D (1 << order); > > + continue; > > + } > > + } > > + > > +next: > > + if (state.order > 0) { > > + next_order =3D state.order - 1; > > + mid_offset =3D offset + (num_chunks / 2); > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit= _state) > > + { next_order, mid_offset }; > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit= _state) > > + { next_order, offset }; > > + } > > + } > > + return collapsed; > > +} > > + > > static int khugepaged_scan_pmd(struct mm_struct *mm, > > struct vm_area_struct *vma, > > unsigned long address, bool *mmap_lock= ed, > > @@ -1445,9 +1523,7 @@ static int khugepaged_scan_pmd(struct mm_struct *= mm, > > pte_unmap_unlock(pte, ptl); > > if (result =3D=3D SCAN_SUCCEED) { > > result =3D collapse_huge_page(mm, address, referenced, > > - unmapped, cc); > > - /* collapse_huge_page will return with the mmap_lock rele= ased */ > > - *mmap_locked =3D false; > > + unmapped, cc, mmap_locked, HP= AGE_PMD_ORDER, 0); > > } > > out: > > trace_mm_khugepaged_scan_pmd(mm, &folio->page, writable, referenc= ed, >