From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1D75C021B1 for ; Thu, 20 Feb 2025 18:49:22 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3AD852802DA; Thu, 20 Feb 2025 13:49:22 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3365A2802BD; Thu, 20 Feb 2025 13:49:22 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 1D7E02802DA; Thu, 20 Feb 2025 13:49:22 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id EA75C2802BD for ; Thu, 20 Feb 2025 13:49:21 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 66931C292E for ; Thu, 20 Feb 2025 18:49:21 +0000 (UTC) X-FDA: 83141210922.03.E52ECD8 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf15.hostedemail.com (Postfix) with ESMTP id A27C9A0011 for ; Thu, 20 Feb 2025 18:49:18 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LNKGRk1y; spf=pass (imf15.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740077359; a=rsa-sha256; cv=none; b=n65zmBhTcIwR28uYuVs4ljQDkImGKA866KbCmbcFDtDj/IadHYWflIUNBBg6+JUzMVZiRc yiJiW7ghAWM+ZSpLl9XAPGDqygjqb6bo4IPZddChK3B3mPPFqJ/u8i/gP2g1Sd1fSV2YjB IR5g39DpvYFPZCoLUWCBk2VKcXjXI9g= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LNKGRk1y; spf=pass (imf15.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1740077359; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=sUOODgfJXNQeyzzsjecbw2Drf+SGFtFyxDLhlahe2Ss=; b=lJTNBZk2qtHq+YkAf82I9NdpVDalJonlnvDrT65vteMdELs/G7LTl+IKI75hKiettMkugq BdD7AVcXd7TgoZaDj1Xvvq5sg8L1PjDgf02fyG0euGvQyOSgGQ5QBB8g4FV7v7UWFUvRO+ s1lHx4Uo/QcAHOoWCDme1FNiVrREcjc= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1740077358; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=sUOODgfJXNQeyzzsjecbw2Drf+SGFtFyxDLhlahe2Ss=; b=LNKGRk1y/e4BMTWYP+9WbKWiofKcdyrM/Byri5yE612bfdP+aLxFU35/QmWkB/h3N2hx+b 3uQzxkPOqq9H0W5oK110guMKAt3EJ9SfGvz5l7K+i7XzoO7ds4qct6hCF5x1fr5HKY4OLa G6CbUSGfNLF7q30aYmoZFt8jr/XkPkc= Received: from mail-yw1-f197.google.com (mail-yw1-f197.google.com [209.85.128.197]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-383-g4-M6SIUOLCcIz2cWW-UEA-1; Thu, 20 Feb 2025 13:49:15 -0500 X-MC-Unique: g4-M6SIUOLCcIz2cWW-UEA-1 X-Mimecast-MFC-AGG-ID: g4-M6SIUOLCcIz2cWW-UEA_1740077354 Received: by mail-yw1-f197.google.com with SMTP id 00721157ae682-6f2c7008c05so17061857b3.0 for ; Thu, 20 Feb 2025 10:49:14 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740077354; x=1740682154; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sUOODgfJXNQeyzzsjecbw2Drf+SGFtFyxDLhlahe2Ss=; b=LL3L7ISPjMXuioXnN93rmu6k7n3I8Co1vBIbKz7xDQQGDeCLMhFezout+TDKcBMuhW H94SG+x8dE6pbpq0uI4lnkBfV0yLLGg2ql2QJesP6gAOwUqlr8KDY/RisQgJX/RrtM5C 3A8YADDiQR6mD9dAHoPOtahGi0xQQsoxVYhyHa4Hhy+eR9gzcW73/UjiEagUG9xfzyUr YIwzD54zKp5KGzHWqB7+x6irSYFwynfvdMoZQZZRTTCUXMKMvHxX+UPrs2evPLEgsmyV GB6GsrErKPzgXU3uY9L+1HrAaxALM5BK9SdjvrLTMP09XRMFA5RfQ6tm0zAHBAPNaOLB 6OUg== X-Forwarded-Encrypted: i=1; AJvYcCXZSgl/h35nHVOLiiTdGIzPlmWywBEyAhXzKaoXYFwPCbLpKDILSXB0OZYPGryQ30exoNwMXDzWLQ==@kvack.org X-Gm-Message-State: AOJu0YwQG4WsruJnoTtFpE6qedJJNWR5m/e+7XWDpzaihMLRhTnfDbn1 dvfB4bH95Hk52M1YJSG7nBIDlhjnZGsDd1AFdSZJjNbQCNcrmfz6fN3RoeJL9PKxhf33RTUuSFM hNO3JXcrr5EfpXja0L/1WSw90l17/LTM8K9PCiodii1P3NfSUuFauHcvMqoOJdYkfS/sAYMdY+h ogmIIeVEZmg0Q0XthROT0WkkQ= X-Gm-Gg: ASbGncvYCcC9GUt5gDVVLB0ogjP6EwmdlFeM5VYk2ASxeu5pEgpHIfj9Cp5u/w3fX7Q /vUQki4S2l1NVg+j3msiR6qJL5zw8YrCeYR4cifY9VyJNcudGplxEcsILUkLYJonZYckefsWCyA c= X-Received: by 2002:a05:690c:61c6:b0:6fb:9474:7b5f with SMTP id 00721157ae682-6fbcc23a926mr1434407b3.14.1740077353730; Thu, 20 Feb 2025 10:49:13 -0800 (PST) X-Google-Smtp-Source: AGHT+IHGq7HcO+16vZtHnzk0o6nsS4cfjbj9QH0ZJWTYiqMTU+ttLvnD3YT70YopmDHRn2kaZzVeyNtkcYLUycbm740= X-Received: by 2002:a05:690c:61c6:b0:6fb:9474:7b5f with SMTP id 00721157ae682-6fbcc23a926mr1433907b3.14.1740077353331; Thu, 20 Feb 2025 10:49:13 -0800 (PST) MIME-Version: 1.0 References: <20250211003028.213461-1-npache@redhat.com> <20250211003028.213461-7-npache@redhat.com> <45b898f2-c514-44d3-bbd9-523772c71b0b@arm.com> In-Reply-To: <45b898f2-c514-44d3-bbd9-523772c71b0b@arm.com> From: Nico Pache Date: Thu, 20 Feb 2025 11:48:47 -0700 X-Gm-Features: AWEUYZkz9CCfhvHyu1E9Ejd74JwsxySTif1aPPoncOXMUYPpDkPwWiE8fgya338 Message-ID: Subject: Re: [RFC v2 6/9] khugepaged: introduce khugepaged_scan_bitmap for mTHP support To: Ryan Roberts Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: V2Tt8mnLXtriUgQ2NqmykTNqLA1tVEXts02vcEpaH00_1740077354 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: 8ypgdtbawdtpmxxuz5eyyboug8rxfexn X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: A27C9A0011 X-Rspam-User: X-HE-Tag: 1740077358-260665 X-HE-Meta: U2FsdGVkX1+nUa6CDzcRe4IUADghsFbW7MUcCSbfJl4EoaNQS0GsgSw68lhZm4V1lpjD/VQwi4bXsVCxTsaLEbj+w6x/UppRLcQlqrOgQv42qb3hEDoHiz6jnARchsPOpmmR4rbxtN5QhuQWuNwQ+iOgvz0339GnaSE9eb0jCU4AOfj/pwH/QhzFL3eG0uIbz6E1W7qHJjewA4B2YxA1lp5wqruHMmCMqYONZS30/OGcE4AS5swtoQmA17Wrwcdc3td1MhpRyW3/JamQY7h+pBJjTCHQPdjOXyfYBf4vpb7ghH+hY4i0MqGJ0L2ac35vgFBhTQc0ltA7FnD+9szrCsvFo1aNXmfM15gQ6jpbXd0XEqd9uFxNluPsRJ1+nwG20T8XLenCBfdFOuk6QEmOai3+FtlGnj++2x2fppX4QU1hSrz+gpFrzCxh4rm6s7WKOK+nr+K5cKRPKpBT3Bs3cFTTut4q8nuCFGyqOE6ZnYSGRaaYF73/9nQOVt+PoVz2XU2mJE4TJCSeQ+MBc8KNBWKicHpS77JLzubmMzaE27fws0Giv/YLCNY1k9R78qhVkjD9SD2+DqYJTBu8LIBHH5e8wXPcyeZSbGiNyxXuZkMNPhO65YGKYFga08n8R2VUQlQJFq7f2jz204zRHXm99WAK9r7Th1BjWtk8Ilj8GT/gTVKLIVj2oiE/AbYcfysNsCSdAUgLN6oG2c+b6qSu7A3PPFBSPhcY7+iTDAyOmo1r8fyv3Rr3vKOiPLWEb7rlLGyYcXGm7eg/g/iJiEEDuTDvkgev+1KJ+N0vf9SLsnItf8kROSOJocKUS26brskMlDsSybhONtkcIV57zivSJzmmDstFhRucZjkuIRUjfx6eQ6LLZfRFgeyXbV15EWbkdSJ+SYxJwiooxMo4ThHwQnUKYmWaa8g12G1+ZX3aahRO5W7OOq+XCH0TGcagTlMoEJQVbNJEZFgdtNo7GBW b+2t2gP4 1N/EKhB/UfYKVDAkLCBiRxAPKT5myUrNoaqflzZKdO/YjC23hrBvowFV+QFKLK67fozNUNJdfzEmCuQhSZpthWL3D8A894s/2djKDnKVJZ4bBgdaNy71A/sGpGAGmzJKkN4ssp64BQ+STlkqGSiV6iyguM0dnnRpG8nmp1jEKQV5S7O0d/0n5C+rBAa47REFfClqhl5SdzQewS7Ovd5Tfyf7ob+NPGnROCWDkQWbzpvSO0YPHmosbVgEE1iP/jcHy16bex6UuuHa5bDb0a4mUSht7iIj+3hSQUUE7NiCOd7SMsDKmu5iUIoKNLZuiHxXn2NimWpM2shp8lAsbD1E2vIg9e7/JLrXATAo221HlSo+jV2+AWTLBGQ69fb7oYysKIW/dnBMl/zQttvSF7Um9npCW/VUmC0bCg0Ymq2eLSsnumYoSv3p1tQh6RAu35TTHlIsS9W+0aQdJvzA= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Feb 19, 2025 at 9:28=E2=80=AFAM Ryan Roberts = wrote: > > On 11/02/2025 00:30, Nico Pache wrote: > > khugepaged scans PMD ranges for potential collapse to a hugepage. To ad= d > > mTHP support we use this scan to instead record chunks of fully utilize= d > > sections of the PMD. > > > > create a bitmap to represent a PMD in order MTHP_MIN_ORDER chunks. > > by default we will set this to order 3. The reasoning is that for 4K 51= 2 > > I'm still a bit confused by this (hopefully to be resolved as I'm about t= o read > the code); Does this imply that the smallest order you can collapse to is= order > 3? Because that would be different from the fault handler. For anon memor= y we > can support order-2 and above. I believe that these days files can suppor= t order-1. Yes, it may have been a premature optimization. I will test with MTHP_MIN_ORDER=3D2 and compare! > > There is a case for wanting to support order-2 for arm64. We have a (not = yet > well deployed) technology called Hardware Page Aggregation (HPA) which ca= n > automatically (transparent to SW) aggregate (usually) 4 contiguous pages = into a > single TLB. I'd like the solution to be compatible with that. Sounds reasonable, especially if the hardware support is going to give that size a huge boost. > > > PMD size this results in a 64 bit bitmap which has some optimizations. > > For other arches like ARM64 64K, we can set a larger order if needed. > > > > khugepaged_scan_bitmap uses a stack struct to recursively scan a bitmap > > that represents chunks of utilized regions. We can then determine what > > mTHP size fits best and in the following patch, we set this bitmap whil= e > > scanning the PMD. > > > > max_ptes_none is used as a scale to determine how "full" an order must > > be before being considered for collapse. > > > > If a order is set to "always" lets always collapse to that order in a > > greedy manner. > > This is not the policy that the fault handler uses, and I think we should= use > the same policy in both places. > > The fault handler gets a list of orders that are permitted for the VMA, t= hen > prefers the highest orders in that list. > > I don't think we should be preferring a smaller "always" order over a lar= ger > "madvise" order if MADV_HUGEPAGE is set for the VMA (if that's what your > statement was suggesting). It does start at the highest order. All this means that if you have PMD=3Dalways 1024kB=3Dmadvise the PMD collapse will always happen (if we dont want this behavior lmk!) > > > > > Signed-off-by: Nico Pache > > --- > > include/linux/khugepaged.h | 4 ++ > > mm/khugepaged.c | 89 +++++++++++++++++++++++++++++++++++--- > > 2 files changed, 86 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/khugepaged.h b/include/linux/khugepaged.h > > index 1f46046080f5..1fe0c4fc9d37 100644 > > --- a/include/linux/khugepaged.h > > +++ b/include/linux/khugepaged.h > > @@ -1,6 +1,10 @@ > > /* SPDX-License-Identifier: GPL-2.0 */ > > #ifndef _LINUX_KHUGEPAGED_H > > #define _LINUX_KHUGEPAGED_H > > +#define MIN_MTHP_ORDER 3 > > +#define MIN_MTHP_NR (1< > +#define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE * PAGE_SIZ= E) - MIN_MTHP_ORDER)) > > I don't think you want "* PAGE_SIZE" here? I think MAX_MTHP_BITMAP_SIZE w= ants to > specify the maximum number of groups of MIN_MTHP_NR pte entries in a PTE = table? > > In that case, MAX_PTRS_PER_PTE will be 512 on x86_64. Your current formul= a will > give 262144 (which is 32KB!). I think you just need: Yes that is correct! Thanks for pointing that out. the bitmap size is supposed to be 64 not 262144! that should save some memory :P > > #define MAX_MTHP_BITMAP_SIZE (1 << (ilog2(MAX_PTRS_PER_PTE) - MIN_MTHP_O= RDER)) > > > +#define MTHP_BITMAP_SIZE (1 << (HPAGE_PMD_ORDER - MIN_MTHP_ORDER)) > > Perhaps all of these macros need a KHUGEPAGED_ prefix? Otherwise MIN_MTHP= _ORDER, > especially, is misleading. The min MTHP order is not 3. I will add the prefixes, thanks! > > > > > extern unsigned int khugepaged_max_ptes_none __read_mostly; > > #ifdef CONFIG_TRANSPARENT_HUGEPAGE > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c > > index 3776055bd477..c8048d9ec7fb 100644 > > --- a/mm/khugepaged.c > > +++ b/mm/khugepaged.c > > @@ -94,6 +94,11 @@ static DEFINE_READ_MOSTLY_HASHTABLE(mm_slots_hash, M= M_SLOTS_HASH_BITS); > > > > static struct kmem_cache *mm_slot_cache __ro_after_init; > > > > +struct scan_bit_state { > > + u8 order; > > + u16 offset; > > +}; > > + > > struct collapse_control { > > bool is_khugepaged; > > > > @@ -102,6 +107,15 @@ struct collapse_control { > > > > /* nodemask for allocation fallback */ > > nodemask_t alloc_nmask; > > + > > + /* bitmap used to collapse mTHP sizes. 1bit =3D order MIN_MTHP_OR= DER mTHP */ > > + DECLARE_BITMAP(mthp_bitmap, MAX_MTHP_BITMAP_SIZE); > > + DECLARE_BITMAP(mthp_bitmap_temp, MAX_MTHP_BITMAP_SIZE); > > + struct scan_bit_state mthp_bitmap_stack[MAX_MTHP_BITMAP_SIZE]; > > +}; > > + > > +struct collapse_control khugepaged_collapse_control =3D { > > + .is_khugepaged =3D true, > > }; > > > > /** > > @@ -851,10 +865,6 @@ static void khugepaged_alloc_sleep(void) > > remove_wait_queue(&khugepaged_wait, &wait); > > } > > > > -struct collapse_control khugepaged_collapse_control =3D { > > - .is_khugepaged =3D true, > > -}; > > - > > static bool khugepaged_scan_abort(int nid, struct collapse_control *cc= ) > > { > > int i; > > @@ -1112,7 +1122,8 @@ static int alloc_charge_folio(struct folio **foli= op, struct mm_struct *mm, > > > > static int collapse_huge_page(struct mm_struct *mm, unsigned long addr= ess, > > int referenced, int unmapped, > > - struct collapse_control *cc) > > + struct collapse_control *cc, bool *mmap_loc= ked, > > + u8 order, u16 offset) > > { > > LIST_HEAD(compound_pagelist); > > pmd_t *pmd, _pmd; > > @@ -1130,8 +1141,12 @@ static int collapse_huge_page(struct mm_struct *= mm, unsigned long address, > > * The allocation can take potentially a long time if it involves > > * sync compaction, and we do not need to hold the mmap_lock duri= ng > > * that. We will recheck the vma after taking it again in write m= ode. > > + * If collapsing mTHPs we may have already released the read_lock= . > > */ > > - mmap_read_unlock(mm); > > + if (*mmap_locked) { > > + mmap_read_unlock(mm); > > + *mmap_locked =3D false; > > + } > > > > result =3D alloc_charge_folio(&folio, mm, cc, HPAGE_PMD_ORDER); > > if (result !=3D SCAN_SUCCEED) > > @@ -1266,12 +1281,71 @@ static int collapse_huge_page(struct mm_struct = *mm, unsigned long address, > > out_up_write: > > mmap_write_unlock(mm); > > out_nolock: > > + *mmap_locked =3D false; > > if (folio) > > folio_put(folio); > > trace_mm_collapse_huge_page(mm, result =3D=3D SCAN_SUCCEED, resul= t); > > return result; > > } > > > > +// Recursive function to consume the bitmap > > +static int khugepaged_scan_bitmap(struct mm_struct *mm, unsigned long = address, > > + int referenced, int unmapped, struct collapse_con= trol *cc, > > + bool *mmap_locked, unsigned long enabled_orders) > > +{ > > + u8 order, next_order; > > + u16 offset, mid_offset; > > + int num_chunks; > > + int bits_set, threshold_bits; > > + int top =3D -1; > > + int collapsed =3D 0; > > + int ret; > > + struct scan_bit_state state; > > + > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit_state) > > + { HPAGE_PMD_ORDER - MIN_MTHP_ORDER, 0 }; > > + > > + while (top >=3D 0) { > > + state =3D cc->mthp_bitmap_stack[top--]; > > + order =3D state.order + MIN_MTHP_ORDER; > > + offset =3D state.offset; > > + num_chunks =3D 1 << (state.order); > > + // Skip mTHP orders that are not enabled > > + if (!test_bit(order, &enabled_orders)) > > + goto next; > > + > > + // copy the relavant section to a new bitmap > > + bitmap_shift_right(cc->mthp_bitmap_temp, cc->mthp_bitmap,= offset, > > + MTHP_BITMAP_SIZE); > > + > > + bits_set =3D bitmap_weight(cc->mthp_bitmap_temp, num_chun= ks); > > + threshold_bits =3D (HPAGE_PMD_NR - khugepaged_max_ptes_no= ne - 1) > > + >> (HPAGE_PMD_ORDER - state.order); > > + > > + //Check if the region is "almost full" based on the thres= hold > > + if (bits_set > threshold_bits > > + || test_bit(order, &huge_anon_orders_always)) { > > + ret =3D collapse_huge_page(mm, address, reference= d, unmapped, cc, > > + mmap_locked, order, offset * MIN_= MTHP_NR); > > + if (ret =3D=3D SCAN_SUCCEED) { > > + collapsed +=3D (1 << order); > > + continue; > > + } > > + } > > + > > +next: > > + if (state.order > 0) { > > + next_order =3D state.order - 1; > > + mid_offset =3D offset + (num_chunks / 2); > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit= _state) > > + { next_order, mid_offset }; > > + cc->mthp_bitmap_stack[++top] =3D (struct scan_bit= _state) > > + { next_order, offset }; > > + } > > + } > > + return collapsed; > > +} > > I'm struggling to understand the details of this function. I'll come back= to it > when I have more time. Hopefully the presentation helped a little. This is a recursive function that uses a stack instead of function calling. This was to remove the recursion, and to make the result handling much easier. Basic idea is to start at PMD and work your way down until you find that the bitmap satisfies the conditions for collapse. If it doesn't collapse, it will add two new collapse attempts to the stack (order--, left and right chunks) > > > + > > static int khugepaged_scan_pmd(struct mm_struct *mm, > > struct vm_area_struct *vma, > > unsigned long address, bool *mmap_lock= ed, > > @@ -1440,7 +1514,7 @@ static int khugepaged_scan_pmd(struct mm_struct *= mm, > > pte_unmap_unlock(pte, ptl); > > if (result =3D=3D SCAN_SUCCEED) { > > result =3D collapse_huge_page(mm, address, referenced, > > - unmapped, cc); > > + unmapped, cc, mmap_locked, HP= AGE_PMD_ORDER, 0); > > /* collapse_huge_page will return with the mmap_lock rele= ased */ > > *mmap_locked =3D false; > > Given that collapse_huge_page() now takes mmap_locked and sets it to fals= e on > return, I don't think we need this line here any longer? I think that is correct! Thanks > > > } > > @@ -2856,6 +2930,7 @@ int madvise_collapse(struct vm_area_struct *vma, = struct vm_area_struct **prev, > > mmdrop(mm); > > kfree(cc); > > > > + > > return thps =3D=3D ((hend - hstart) >> HPAGE_PMD_SHIFT) ? 0 > > : madvise_collapse_errno(last_fail); > > } >