From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id ECDBCC021A9
	for <linux-mm@archiver.kernel.org>; Mon, 17 Feb 2025 19:19:34 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 80CED28008F; Mon, 17 Feb 2025 14:19:34 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 7BD2C28008E; Mon, 17 Feb 2025 14:19:34 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 636B228008F; Mon, 17 Feb 2025 14:19:34 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 40B8A28008E
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 14:19:34 -0500 (EST)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id DABFBAF9AC
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 19:19:33 +0000 (UTC)
X-FDA: 83130400626.04.CDCF165
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124])
	by imf29.hostedemail.com (Postfix) with ESMTP id 7BD60120008
	for <linux-mm@kvack.org>; Mon, 17 Feb 2025 19:19:31 +0000 (UTC)
Authentication-Results: imf29.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=f1pXAtPX;
	spf=pass (imf29.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1739819971;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=c4RTmagCtQ7L8fOrNlgAR8YOXAP1OFqBaFYkx7yQAZg=;
	b=nCVEJtjcWDJz25dHypffTIUxXHw9guOyL2CCPQ1AhQW3ck/gzjrZJAbTcmjZL6wWhdqVYy
	DZsshnuxW5li7NWlQpOVX1hMKFLhhxl7O2y24IbAeQaC4x8mpfpCKreVWGmX5217ooAAdC
	SVU358lkEulUjpPLBgbyGJ/bKSyboqk=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739819971; a=rsa-sha256;
	cv=none;
	b=t/vEbiTQW/mxL1Y9gWH2RbkiUUlNsLD8dMaJkKljbRqF4RcW0RLQHg/fNnPlDHDax2o7tR
	HRAI3Kaa2m0O003cDEWkmvo7DtCT/56AIs/VYQQHaKjZSN/PhKVLNnOFp2oPREIY+GRnSK
	QsvG3FDbevn/j11CIN/guryzfaFVcYI=
ARC-Authentication-Results: i=1;
	imf29.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=f1pXAtPX;
	spf=pass (imf29.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com;
	dmarc=pass (policy=none) header.from=redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1739819970;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=c4RTmagCtQ7L8fOrNlgAR8YOXAP1OFqBaFYkx7yQAZg=;
	b=f1pXAtPXRxqYhpLS22AH6G1h25P2RLOXkO/t2SlPqq5pkFqDFIbflQrOwR3LScenUiNghe
	b65Jns5KxskfUNdtB1jtbbbe+Qgw1VLelDrDRUyMyv5goMkklu45ihnrFNB/Q6loe125dK
	ZOn/Y4e9ngtRm2UADJptPTpwCNMhWgE=
Received: from mail-yw1-f197.google.com (mail-yw1-f197.google.com
 [209.85.128.197]) by relay.mimecast.com with ESMTP with STARTTLS
 (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id
 us-mta-298-h_sJ7sRbP5G-dBNtwCaPvw-1; Mon, 17 Feb 2025 14:19:29 -0500
X-MC-Unique: h_sJ7sRbP5G-dBNtwCaPvw-1
X-Mimecast-MFC-AGG-ID: h_sJ7sRbP5G-dBNtwCaPvw_1739819969
Received: by mail-yw1-f197.google.com with SMTP id 00721157ae682-6f46924f63eso73559767b3.0
        for <linux-mm@kvack.org>; Mon, 17 Feb 2025 11:19:29 -0800 (PST)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1739819969; x=1740424769;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=c4RTmagCtQ7L8fOrNlgAR8YOXAP1OFqBaFYkx7yQAZg=;
        b=k7CIYOoXCHG9b8TLaIpKNwIEa6WNEV5wM/R5TMevOFD8V/TF1BYcWJFyw+hMrrhK7N
         PwfrE4e3j/WWFjWH8qZAS+XNbA1BEvW3WZ5mton5EzhfeMOcnAiA82zyPJQ5F5GXVcjm
         KviQ/AJ4IJHPXp8vrKWclRUjlbdylcS6eBe9QVqiluIPCaRYEkoA4PkrB3cJ12B6B+jO
         Pc0DECbhxZeDm5sZGnWBuYiWQk+vnQfySAqt9LrHSePgn2ceXKCl4vzeVcqazSQeBsr5
         iPzu6/sIW2gfJBP1pP/h++mWpRQckTqvEUygkJd2dEPtB8G8bQpcY+aJq3EPWNlE3owJ
         1pTg==
X-Forwarded-Encrypted: i=1; AJvYcCUwkhWW+4zd7JCmE4/2PTGp4FJTeeD/tRwaUlyXU2uqn34JePyFHzs/0pRa26gGts+o1f8zmpKNQQ==@kvack.org
X-Gm-Message-State: AOJu0YwExmeOqH/vfLgMCU2DkXbisVwhCK7IyAVIytMV3jcbwUjedXQo
	5eDDzHdM1kFiPTOVPaRGzOA6v+ahr7OcxDOHiaGc3OnRim8eP0OEpYoqn/Nx2sseS+tz19kbo0p
	yIoZfn3wS8mbLyNCkSfLb2dFKWLStOcP75PSk2a1pbnsgxfPUtIqM3woJ/vAX/K4zhj5o3+vIS4
	Xx2fMK0sXCO1FmX6DCyFApAaisqUHGv5z3jOo2
X-Gm-Gg: ASbGnctFW5w1Izv8JB3EbH3B7sJXiUBjqSbSkEqp+3x+oc6HB0uYHaIY1kwfPMIFCJK
	/REyExJV4sHVRge8B49lKgB0hAJ/BMqWUAFvkgwodp6DHmO9UouX8XklaNQ7Lh12QobKj7ttRgc
	A=
X-Received: by 2002:a05:690c:2a45:b0:6f9:5a36:577d with SMTP id 00721157ae682-6fb33c9fac9mr132516177b3.9.1739819968759;
        Mon, 17 Feb 2025 11:19:28 -0800 (PST)
X-Google-Smtp-Source: AGHT+IGmlyIiuMkLareBFH5LjuAeLUQ1wKHJM87kI6clK2Ft6vIF6GOF1dYdefMy+CfQa0XnHJ3M+8dvznnYefTu1Lc=
X-Received: by 2002:a05:690c:2a45:b0:6f9:5a36:577d with SMTP id
 00721157ae682-6fb33c9fac9mr132515797b3.9.1739819968302; Mon, 17 Feb 2025
 11:19:28 -0800 (PST)
MIME-Version: 1.0
References: <20250211003028.213461-1-npache@redhat.com> <5a995dc9-fee7-442f-b439-c484d9de1750@arm.com>
 <CAA1CXcCo3eCH-5axnw3WUqPfL-EPCbLVFo_AFYXkbnExfX=KLQ@mail.gmail.com>
 <4ba52062-1bd3-4d53-aa28-fcbbd4913801@arm.com> <CAA1CXcCeFjCx-6VDCjOUUniLHXMUYpjesOhcHQapMS5i7=RR4A@mail.gmail.com>
 <71490f8c-f234-4032-bc2a-f6cffa491fcb@arm.com> <CAA1CXcCkbxe_D4cSroBTu3E4dn6SaaDDk6xo9y4O9cWW-9Rc9Q@mail.gmail.com>
 <5445bc55-6bd2-46fd-8107-99eb31aee172@arm.com> <7c2b0248-1153-4ce1-9170-56eeb0511ff1@arm.com>
In-Reply-To: <7c2b0248-1153-4ce1-9170-56eeb0511ff1@arm.com>
From: Nico Pache <npache@redhat.com>
Date: Mon, 17 Feb 2025 12:19:02 -0700
X-Gm-Features: AWEUYZkfVBNHyGTMNmagUCtMciQTKfDZkfi_KIUrOqFuSxC6aFOxJcfKCdHgzNQ
Message-ID: <CAA1CXcBDp3=QLPW3V6OOrPtpXt-XHnG5766O5PV5+esDiW9v-Q@mail.gmail.com>
Subject: Re: [RFC v2 0/9] khugepaged: mTHP support
To: Dev Jain <dev.jain@arm.com>
Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, 
	linux-mm@kvack.org, ryan.roberts@arm.com, anshuman.khandual@arm.com, 
	catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, 
	apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, 
	baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, 
	haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, 
	yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, 
	wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, 
	surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, 
	zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, 
	willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, 
	aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, 
	usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, 
	rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de
X-Mimecast-Spam-Score: 0
X-Mimecast-MFC-PROC-ID: 9hDhp_ZE-nunhzNVFLuvKRNMY9wctwcMEUOJqc_E7b0_1739819969
X-Mimecast-Originator: redhat.com
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspam-User: 
X-Rspamd-Queue-Id: 7BD60120008
X-Rspamd-Server: rspam07
X-Stat-Signature: bu5to31pib7h1z4unex5998q5cbq3ak1
X-HE-Tag: 1739819971-177868
X-HE-Meta: U2FsdGVkX18OtuU7qxORhVlclHQaZ9idLXt77wJRLgvKZMre9vRNDw1hDwBDsNPWp0Ogp3l3/DJ1WgNyzDm4Aq65CZB0ghuSsOFV2MemkJay0P6a1Zs/FqdU3NDEU0znbDLZGOTH3zGlhRpESDT1/y5OwQGzUAA4CEzEjAd40iKDE1KJsaj9uJhvB/SAb7uRAieOoSjz2YYoaz/h+m1tpI60Cq+HzVx2GvPijZVrCqw5MkTqQFL3eRrvbnaeUcHigYpIRGz9Yz/K3Y2eHJA+zqnlAPwexKmQRZALiV5gvnM5tHxDT8KNoWP40AWCRh3aCzXr0Y5EcGFh8lEOqGX9ZJE4CIn96HC+/yOtI4Il8b1mdSfWCQKWEiL1H6/l92gR5oIdE4q2eZDxgM1MZORlXB/hEH9rOPS0757/LZrfCJ87CshERmamkoNQwHVpGzI7t7BDAYQQXtudu8DsRxNW5uSRkTOUgiLnFcYlN3pRAR8wWPfmAV6cacmCHavu4+GX1fI7qpVhdzjig3LnfvbgzV3iZAJNzaK8ikNLL7FA4dMVds0x5AqTBPZmh+2ayW5TMYR4XHfHKAQnxIHYua5eUmojEmgcWx6JBH7z9XVZJkNE1Cn/yf5SVS90NKhzerCHElJdqOMoodNFaTMybQ+hK6YLojdl1zK1cVSypvfPTtR4K1cLopQSUg7FLZ8yyJdh1hlo1NDc7UfL21I5NKkKDJidYNXJK9Pb8ZSFRXIHa+/nGFs7l7oPXQJXCGRRaSi5HV0dIKiAsHeRCi+RRoliM0Khs6fJFJ4/RWrkAdzvR4sXGwWmuEuNh9W1kYubbVPGsKEUWXjYbqwQTb2SGf6YwPpYA8qJAcbc3h9czbTXzSnBF3HRpJN3iIRM0gmK7M+1GM13jWKF6DbOQxPTUUn0B5n9YOHt3Pt4F3jY/vwuGYLI6Zpcec8/hiiOZOO6EBbBG7xZFZyGScYBRnFUFBO
 IneBH5e6
 hfsMgjFPwPTpLtNpUyT0DUHQIWGv1H0ru2Kyx3PFbHb/LmJEEC7BsASKBeuMOmbEvdV8mKenwV7XtuuZ6Wn7H0CscKI0TJ3kzmQ0PbTpDWkz82xTWk8tQKCD9Kp1Y8+3nmY9LeNErzOL6UkviDOmPGXODhVrDd7tOSZYzUIQz1JEa59BDt0h6Ghl7E/5PlPPcxfiGU0S8GQC7Q+EMTSNn+KWQQ2DCJQy8gQsiKviP0zJPqewWP0wA3UQCFVAgho4f94O0lIZLnozHiJ25+/JRN2bY0G3zWepqvqHfG7FYMv5h19Gr3mW1u2EtmFnPtvUo7sFhPINPspHoNnwsnjS7utm0YjzyMNm3l6Wg0Z1L60JFxu5GEQEIHupjdfRDlutRLP/7oLAcpTQ2JimV2SjHIknB6mQZPY/ByMMVeae7ZZEy1dgPX43ONXH1duL7QUEVDN+Pb4PaBxkNsV3Efo03lSJvw44k6JUgdoC7j0gSvyZcCKPV9VV/pQ/rHBOhtIO0HqE/NOc1Qe3Q+wnJYo0i7WwIaw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Mon, Feb 17, 2025 at 1:06=E2=80=AFAM Dev Jain <dev.jain@arm.com> wrote:
>
>
>
> On 15/02/25 12:08 pm, Dev Jain wrote:
> >
> >
> > On 15/02/25 6:22 am, Nico Pache wrote:
> >> On Thu, Feb 13, 2025 at 7:02=E2=80=AFPM Dev Jain <dev.jain@arm.com> wr=
ote:
> >>>
> >>>
> >>>
> >>> On 14/02/25 1:09 am, Nico Pache wrote:
> >>>> On Thu, Feb 13, 2025 at 1:26=E2=80=AFAM Dev Jain <dev.jain@arm.com> =
wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 12/02/25 10:19 pm, Nico Pache wrote:
> >>>>>> On Tue, Feb 11, 2025 at 5:50=E2=80=AFAM Dev Jain <dev.jain@arm.com=
> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 11/02/25 6:00 am, Nico Pache wrote:
> >>>>>>>> The following series provides khugepaged and madvise collapse
> >>>>>>>> with the
> >>>>>>>> capability to collapse regions to mTHPs.
> >>>>>>>>
> >>>>>>>> To achieve this we generalize the khugepaged functions to no
> >>>>>>>> longer depend
> >>>>>>>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks
> >>>>>>>> of pages
> >>>>>>>> (defined by MTHP_MIN_ORDER) that are utilized. This info is trac=
ked
> >>>>>>>> using a bitmap. After the PMD scan is done, we do binary
> >>>>>>>> recursion on the
> >>>>>>>> bitmap to find the optimal mTHP sizes for the PMD range. The
> >>>>>>>> restriction
> >>>>>>>> on max_ptes_none is removed during the scan, to make sure we
> >>>>>>>> account for
> >>>>>>>> the whole PMD range. max_ptes_none will be scaled by the
> >>>>>>>> attempted collapse
> >>>>>>>> order to determine how full a THP must be to be eligible. If a
> >>>>>>>> mTHP collapse
> >>>>>>>> is attempted, but contains swapped out, or shared pages, we dont
> >>>>>>>> perform the
> >>>>>>>> collapse.
> >>>>>>>>
> >>>>>>>> With the default max_ptes_none=3D511, the code should keep its
> >>>>>>>> most of its
> >>>>>>>> original behavior. To exercise mTHP collapse we need to set
> >>>>>>>> max_ptes_none<=3D255.
> >>>>>>>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse
> >>>>>>>> "creep" and
> >>>>>>>> constantly promote mTHPs to the next available size.
> >>>>>>>>
> >>>>>>>> Patch 1:     Some refactoring to combine madvise_collapse and
> >>>>>>>> khugepaged
> >>>>>>>> Patch 2:     Refactor/rename hpage_collapse
> >>>>>>>> Patch 3-5:   Generalize khugepaged functions for arbitrary order=
s
> >>>>>>>> Patch 6-9:   The mTHP patches
> >>>>>>>>
> >>>>>>>> ---------
> >>>>>>>>      Testing
> >>>>>>>> ---------
> >>>>>>>> - Built for x86_64, aarch64, ppc64le, and s390x
> >>>>>>>> - selftests mm
> >>>>>>>> - I created a test script that I used to push khugepaged to its
> >>>>>>>> limits while
> >>>>>>>>        monitoring a number of stats and tracepoints. The code is
> >>>>>>>> available
> >>>>>>>>        here[1] (Run in legacy mode for these changes and set
> >>>>>>>> mthp sizes to inherit)
> >>>>>>>>        The summary from my testings was that there was no
> >>>>>>>> significant regression
> >>>>>>>>        noticed through this test. In some cases my changes had
> >>>>>>>> better collapse
> >>>>>>>>        latencies, and was able to scan more pages in the same
> >>>>>>>> amount of time/work,
> >>>>>>>>        but for the most part the results were consistant.
> >>>>>>>> - redis testing. I tested these changes along with my defer chan=
ges
> >>>>>>>>       (see followup post for more details).
> >>>>>>>> - some basic testing on 64k page size.
> >>>>>>>> - lots of general use. These changes have been running in my VM
> >>>>>>>> for some time.
> >>>>>>>>
> >>>>>>>> Changes since V1 [2]:
> >>>>>>>> - Minor bug fixes discovered during review and testing
> >>>>>>>> - removed dynamic allocations for bitmaps, and made them stack
> >>>>>>>> based
> >>>>>>>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize.
> >>>>>>>> - Updated trace events to include collapsing order info.
> >>>>>>>> - Scaled max_ptes_none by order rather than scaling to a 0-100
> >>>>>>>> scale.
> >>>>>>>> - No longer require a chunk to be fully utilized before setting
> >>>>>>>> the bit. Use
> >>>>>>>>        the same max_ptes_none scaling principle to achieve this.
> >>>>>>>> - Skip mTHP collapse that requires swapin or shared handling.
> >>>>>>>> This helps prevent
> >>>>>>>>        some of the "creep" that was discovered in v1.
> >>>>>>>>
> >>>>>>>> [1] - https://gitlab.com/npache/khugepaged_mthp_test
> >>>>>>>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-
> >>>>>>>> npache@redhat.com/
> >>>>>>>>
> >>>>>>>> Nico Pache (9):
> >>>>>>>>       introduce khugepaged_collapse_single_pmd to unify
> >>>>>>>> khugepaged and
> >>>>>>>>         madvise_collapse
> >>>>>>>>       khugepaged: rename hpage_collapse_* to khugepaged_*
> >>>>>>>>       khugepaged: generalize hugepage_vma_revalidate for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: generalize alloc_charge_folio for mTHP support
> >>>>>>>>       khugepaged: generalize __collapse_huge_page_* for mTHP
> >>>>>>>> support
> >>>>>>>>       khugepaged: introduce khugepaged_scan_bitmap for mTHP supp=
ort
> >>>>>>>>       khugepaged: add mTHP support
> >>>>>>>>       khugepaged: improve tracepoints for mTHP orders
> >>>>>>>>       khugepaged: skip collapsing mTHP to smaller orders
> >>>>>>>>
> >>>>>>>>      include/linux/khugepaged.h         |   4 +
> >>>>>>>>      include/trace/events/huge_memory.h |  34 ++-
> >>>>>>>>      mm/khugepaged.c                    | 422 ++++++++++++++++++
> >>>>>>>> +----------
> >>>>>>>>      3 files changed, 306 insertions(+), 154 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Does this patchset suffer from the problem described here:
> >>>>>>> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-
> >>>>>>> c2d48d4963b6@arm.com/
> >>>>>> Hi Dev,
> >>>>>>
> >>>>>> Sorry I meant to get back to you about that.
> >>>>>>
> >>>>>> I understand your concern, but like I've mentioned before, the sca=
n
> >>>>>> with the read lock was done so we dont have to do the more expensi=
ve
> >>>>>> locking, and could still gain insight into the state. You are righ=
t
> >>>>>> that this info could become stale if the state changes dramaticall=
y,
> >>>>>> but the collapse_isolate function will verify it and not collapse.
> >>>>>
> >>>>> If the state changes dramatically, the _isolate function will
> >>>>> verify it,
> >>>>> and fallback. And this fallback happens after following this costly
> >>>>> path: retrieve a large folio from the buddy allocator -> swapin pag=
es
> >>>>> from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB
> >>>>> flush
> >>>>> on all CPUs -> fallback in _isolate().
> >>>>> If you do fail in _isolate(), doesn't it make sense to get the upda=
ted
> >>>>> state for the next fallback order immediately, because we have prio=
r
> >>>>> information that we failed because of PTE state? What your algorith=
m
> >>>>> will do is *still* follow the costly path described above, and agai=
n
> >>>>> fail in _isolate(), instead of failing in hpage_collapse_scan_pmd()
> >>>>> like
> >>>>> mine would.
> >>>>
> >>>> You do raise a valid point here, I can optimize my solution by
> >>>> detecting certain collapse failure types and jump to the next scan.
> >>>> I'll add that to my solution, thanks!
> >>>>
> >>>> As for the disagreement around the bitmap, we'll leave that up to th=
e
> >>>> community to decide since we have differing opinions/solutions.
> >>>>
> >>>>>
> >>>>> The verification of the PTE state by the _isolate() function is the
> >>>>> "no
> >>>>> turning back" point of the algorithm. The verification by
> >>>>> hpage_collapse_scan_pmd() is the "let us see if proceeding is even
> >>>>> worth
> >>>>> it, before we do costly operations" point of the algorithm.
> >>>>>
> >>>>>>    From my testing I found this to rarely happen.
> >>>>>
> >>>>> Unfortunately, I am not very familiar with performance testing/load
> >>>>> testing, I am fairly new to kernel programming, so I am getting the=
re.
> >>>>> But it really depends on the type of test you are running, what
> >>>>> actually
> >>>>> runs on memory-intensive systems, etc etc. In fact, on loaded
> >>>>> systems I
> >>>>> would expect the PTE state to dramatically change. But still, no
> >>>>> opinion
> >>>>> here.
> >>>>
> >>>> Yeah there are probably some cases where it happens more often.
> >>>> Probably in cases of short lived allocations, but khugepaged doesn't
> >>>> run that frequently so those won't be that big of an issue.
> >>>>
> >>>> Our performance team is currently testing my implementation so I
> >>>> should have more real workload test results soon. The redis testing
> >>>> had some gains and didn't show any signs of obvious regressions.
> >>>>
> >>>> As for the testing, check out
> >>>> https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-
> >>>> khuge-performance.sh?ref_type=3Dheads
> >>>> this does the tracing for my testing script. It can help you get
> >>>> started. There are 3 different traces being applied there: the
> >>>> bpftrace for collapse latencies, the perf record for the flamegraph
> >>>> (not actually that useful, but may be useful to visualize any
> >>>> weird/long paths that you may not have noticed), and the trace-cmd
> >>>> which records the tracepoint of the scan and the collapse functions
> >>>> then processes the data using the awk script-- the output being the
> >>>> scan rate, the pages collapsed, and their result status (grouped by
> >>>> order).
> >>>>
> >>>> You can also look into https://github.com/gormanm/mmtests for
> >>>> testing/comparing kernels. I was running the
> >>>> config-memdb-redis-benchmark-medium workload.
> >>>
> >>> Thanks. I'll take a look.
> >>>
> >>>>
> >>>>>
> >>>>>>
> >>>>>> Also, khugepaged, my changes, and your changes are all a victim of
> >>>>>> this. Once we drop the read lock (to either allocate the folio, or
> >>>>>> right before acquiring the write_lock), the state can change. In y=
our
> >>>>>> case, yes, you are gathering more up to date information, but is i=
t
> >>>>>> really that important/worth it to retake locks and rescan for each
> >>>>>> instance if we are about to reverify with the write lock taken?
> >>>>>
> >>>>> You said "reverify": You are removing the verification, so this ste=
p
> >>>>> won't be reverification, it will be verification. We do not want to
> >>>>> verify *after* we have already done 95% of latency-heavy stuff,
> >>>>> only to
> >>>>> know that we are going to fail.
> >>>>>
> >>>>> Algorithms in the kernel, in general, are of the following form: 1)
> >>>>> Verify if a condition is true, resulting in taking a control path -
> >>>>> > 2)
> >>>>> do a lot of stuff -> "no turning back" step, wherein before committ=
ing
> >>>>> (by taking locks, say), reverify if this is the control path we sho=
uld
> >>>>> be in. You are eliminating step 1).
> >>>>>
> >>>>> Therefore, I will have to say that I disagree with your approach.
> >>>>>
> >>>>> On top of this, in the subjective analysis in [1], point number 7
> >>>>> (along
> >>>>> with point number 1) remains. And, point number 4 remains.
> >>>>
> >>>> for 1) your worst case of 1024 is not the worst case. There are 8
> >>>> possible orders in your implementation, if all are enabled, that is
> >>>> 4096 iterations in the worst case.
> >>>
> >>> Yes, that is exactly what I wrote in 1). I am still not convinced tha=
t
> >>> the overhead you produce + 512 iterations is going to beat 4096
> >>> iterations. Anyways, that is hand-waving and we should test this.
> >>>
> >>>> This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096
> >>>> in my case.
> >>>
> >>> Sorry, I am missing something here; how does the number of iterations
> >>> change with page size? Am I not scanning the PTE table, which is
> >>> invariant to the page size?
> >>
> >> I got the calculation wrong the first time and it's actually worst.
> >> Lets hope I got this right this time
> >> on ARM64 64k kernel:
> >> PMD size =3D 512M
> >> PTE=3D 64k
> >> PTEs per PMD =3D 8192
> >
> > *facepalm* my bad, thanks. I got thrown off thinking HPAGE_PMD_NR won't
> > depend on page size, but #pte entries =3D PAGE_SIZE / sizeof(pte) =3D
> > PAGE_SIZE / 8. So it does depend. You are correct, the PTEs per PMD is =
1
> > << 13.
> >
> >> log2(8192) =3D 13 - 2 =3D 11 number of (m)THP sizes including PMD (the
> >> first and second order are skipped)
> >>
> >> Assuming I understand your algorithm correctly, in the worst case you
> >> are scanning the whole PMD for each order.
> >>
> >> So you scan 8192 PTEs 11 times. 8192 * 11 =3D 90112.
> >
> > Yup. Now it seems that the bitmap overhead may just be worth it; for th=
e
> > worst case the bitmap will give us an 11x saving...for the average case=
,
> > it will give us 2x, but still, 8192 is a large number. I'll think of
>
> Clearing on this: the saving is w.r.t the initial scan. That is, if time
> taken by NP is x + y + collapse_huge_page(), where x is the PMD scan and
> y is the bitmap overhead, then time taken by DJ is 2x +
> collapse_huge_page(). In collapse_huge_page(), both perform PTE scans in
> _isolate(). Anyhow, we differ in opinion as to where the max_ptes_*
> check should be placed; I recalled the following:
>
> https://lore.kernel.org/all/20240809103129.365029-2-dev.jain@arm.com/
> https://lore.kernel.org/all/761ba58e-9d6f-4a14-a513-dcc098c2aa94@redhat.c=
om/
>
> One thing you can do to relieve one of my criticisms (not completely) is
> apply the following patch (this can be done in both methods):
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index b589f889bb5a..dc5cb602eaad 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1080,8 +1080,14 @@ static int __collapse_huge_page_swapin(struct
> mm_struct *mm,
>                 }
>
>                 vmf.orig_pte =3D ptep_get_lockless(pte);
> -               if (!is_swap_pte(vmf.orig_pte))
> +               if (!is_swap_pte(vmf.orig_pte)) {
>                         continue;
> +               } else {
> +                       if (order !=3D HPAGE_PMD_ORDER) {
> +                               result =3D SCAN_EXCEED_SWAP_PTE;
> +                               goto out;
> +                       }
> +               }
>
>                 vmf.pte =3D pte;
>                 vmf.ptl =3D ptl;
> --
>
> But this really is the same thing being done in the links above :)

Yes my code already does this,
@@ -1035,6 +1039,11 @@ static int __collapse_huge_page_swapin(struct
mm_struct *mm,
                if (!is_swap_pte(vmf.orig_pte))
                        continue;

+               if (order !=3D HPAGE_PMD_ORDER) {
+                       result =3D SCAN_EXCEED_SWAP_PTE;
+                       goto out;
+               }
+
No need for the if/else nesting.
>
>
> > ways to test this out.
> >
> > Btw, I was made aware that an LWN article just got posted on our work!
> > https://lwn.net/Articles/1009039/
> >
> >>
> >> Please let me know if I'm missing something here.
> >>>
> >>>>>
> >>>>> [1]
> >>>>> https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-
> >>>>> aba4b1a441b4@arm.com/
> >>>>>
> >>>>>>
> >>>>>> So in my eyes, this is not a "problem"
> >>>>>
> >>>>> Looks like the kernel scheduled us for a high-priority debate, I ho=
pe
> >>>>> there's no deadlock :)
> >>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> -- Nico
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
>