From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 08B0FC021AA for ; Tue, 18 Feb 2025 22:30:59 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 46DE2280199; Tue, 18 Feb 2025 17:30:58 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 3F87628019B; Tue, 18 Feb 2025 17:30:58 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 270F1280199; Tue, 18 Feb 2025 17:30:58 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 03B9B6B00DE for ; Tue, 18 Feb 2025 17:30:57 -0500 (EST) Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 577B8C0650 for ; Tue, 18 Feb 2025 22:30:57 +0000 (UTC) X-FDA: 83134511754.29.4AE03F6 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf26.hostedemail.com (Postfix) with ESMTP id 979AF140004 for ; Tue, 18 Feb 2025 22:30:54 +0000 (UTC) Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TJ43UKLF; spf=pass (imf26.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739917854; a=rsa-sha256; cv=none; b=O2Hnv4dJWF65wtJdG8PtwNrs2gJEYvDBgjxw5rc+XFmDtQiJ8Er139uMrGbiYs4kM/0jrP 5PDarvecoBPsn/cYmb24OlrZxuNRJ/KJLIWD2LA+d567l7679o8Np28ecYAziB6S2HZS1D QRkAfprCXoGaZC1h2w8HFuHL/BoaNmc= ARC-Authentication-Results: i=1; imf26.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=TJ43UKLF; spf=pass (imf26.hostedemail.com: domain of npache@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739917854; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=++ZZtSfrzrR+I5/WlbD0caKDizF6oFHBUKKBB5cjYzQ=; b=06jwFaJo8rBma/K95MihLKz3mFpmLCB++t0/bJ5G715SzsbnXTrjDPHU9keaVE6ntj+HEU 8bcebfvH0gchRtoZyqYJIKj/TbSVpzF9Wy2v+IkkrFlj4H9TgjDLuOeTJvZUgRvFiB63ot +15XYEOBUOIqYO1KgvJA0Feb5jBo8yA= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739917853; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=++ZZtSfrzrR+I5/WlbD0caKDizF6oFHBUKKBB5cjYzQ=; b=TJ43UKLFyr3xI1vFZ9xFR5zmg7kYaQns6U538d5WVgzyWKo62URgFNkApGByICzzoUNCC/ znTrLxbF6fG5/BgU6k9tkarJq635tYNEyEZCNhooPuwHDfmD3hbu4fLm61GEsxW98uguTp lb7hH+5rVDnElBqbo268am8I8gUkT3I= Received: from mail-yb1-f198.google.com (mail-yb1-f198.google.com [209.85.219.198]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-648-t9nLHbxFN0OlzKwEEQbGuQ-1; Tue, 18 Feb 2025 17:30:52 -0500 X-MC-Unique: t9nLHbxFN0OlzKwEEQbGuQ-1 X-Mimecast-MFC-AGG-ID: t9nLHbxFN0OlzKwEEQbGuQ_1739917852 Received: by mail-yb1-f198.google.com with SMTP id 3f1490d57ef6-e549de22484so8107017276.2 for ; Tue, 18 Feb 2025 14:30:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739917852; x=1740522652; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=++ZZtSfrzrR+I5/WlbD0caKDizF6oFHBUKKBB5cjYzQ=; b=EW220LGFoeCuy8xxYQW8CSC9LMuCPvTX/4MIrZFWV5fLdC+j8g2ahfPiJWW3EoEFEk 4EGOf7zgrbQnvRrrWN9gkYdKQ73O345r1lC3UdiZYKMvnUm3sK3BE6J9aM0vV0WhSc7j ZUXQC52MjgS/x3WoWPevOKTi603B6tX9tIUYgM3Tm/4SxUAGVvP12QSX9R9OY2Xio42v 98ibMswUWXxYe9JFae3ewQmumAxLlUF2ycz3v2uLnTyS1m6rxTer24ibyaBtJHw9OuTS cZo/JqCLxJ5RT8DbNY951lWqSrUpVwXJxBc3VDcz8oM5MaXUNLiBO0IpiymbS1Z0XdD6 bsbg== X-Forwarded-Encrypted: i=1; AJvYcCWr+7rnWaUZ1lAA8zNTVooqIpFr1gcLpGBeHE8kIhHrJry0z5x5D0IKzrHLbBX+OMrKcejP1EcW8g==@kvack.org X-Gm-Message-State: AOJu0YzUySytexB0Fo2iTIrhSvVMTuEvMIacItINwaYywf3kST6MIEXM LMy0RrxUGTu5zVLdI8mbi7cUNuko2/vF14ycwTBNoau0OgK/uWqB1E7ZKgIwniirz7MHppCfArb sqJfAVez7/EYoZqqBKDCGETdxnvFRJ27O8P1mAorSzubVttyh+h4YKjXh4vtxoZ22lcHGmEOLWl zSba8QSKZBcmmSrQIRlZWrEOg= X-Gm-Gg: ASbGncsLQLCpdt7KQM+DEPUA9kyarynsWTL3lg4T8IHOlqkZjUDEdfX5uZbi/0GziFy x79FZtAIR3kk/b2BhSF7TYEjJNUbhAZSeotogw7PliB+zkkHlLUuE+TMf0JuVGENowt8+EVPR/g c= X-Received: by 2002:a05:6902:12ce:b0:e57:4167:477d with SMTP id 3f1490d57ef6-e5e0a07f57fmr1423361276.2.1739917852058; Tue, 18 Feb 2025 14:30:52 -0800 (PST) X-Google-Smtp-Source: AGHT+IGXOggV0aGa9dhi71CcQznSGrcLr72fkVHEEod12dqEeXtyy//O3EWuT2LzXWxrsj2Oqd1NZ6yma04BBYypOMU= X-Received: by 2002:a05:6902:12ce:b0:e57:4167:477d with SMTP id 3f1490d57ef6-e5e0a07f57fmr1423292276.2.1739917851593; Tue, 18 Feb 2025 14:30:51 -0800 (PST) MIME-Version: 1.0 References: <20250211003028.213461-1-npache@redhat.com> <8a37f99b-f207-4688-bc90-7f8e6900e29d@arm.com> In-Reply-To: <8a37f99b-f207-4688-bc90-7f8e6900e29d@arm.com> From: Nico Pache Date: Tue, 18 Feb 2025 15:30:25 -0700 X-Gm-Features: AWEUYZnX2s298UT6JFBa2HNyK13qmxR8KP3hyR_dufgpxOdwX7jib87hqYzN1Ro Message-ID: Subject: Re: [RFC v2 0/9] khugepaged: mTHP support To: Ryan Roberts Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, dev.jain@arm.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: vqzfC221C2qiT7Rfu-tjizuDgc25yGT_7AH1IJp8uCs_1739917852 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: nwxmyjtqmpd76mw9fsogfm7d6rxubeho X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 979AF140004 X-Rspam-User: X-HE-Tag: 1739917854-83429 X-HE-Meta: U2FsdGVkX1+8iKEq1Rp0YLbAB19sdgDFdzTEmzXzWLvK5GXVmOh8RJk7QGtb07bD9odY7yotXQbq10/f7XJ8ly0ctvW+MH6dNGx3ZucuMZh84LZxMJdRzKf0XeLkRZZSUcTCMF/jXn/yk2GcbAUmZAd3i+WigwvlvDmwt1CAChmD1xbfp29rS6Wk0fS47Q/l+kKQHhHbzXVTZw+REkg7Pmk02RhLg5b9AzmJtSUBbEAkHEP478fOFEOJQ+LRIOhGE+3ArDBaXtloaaGvYn1XeQZVHvRo2JMHShQU20621VhoJYhN4+breZxVbRQ9+kQ4CgIUNFO5IQJge6ShAAllPIHOYq+uLSW9sxHoxOQce9FtoSEpKcvkY63xBfhKlkmZ+1nBRTyt5BmQTCdPLxKKDueVAkNNFnTfFVaEPCIhWrJ1XpRu77pl2bHFSQZ9tT+AAA99d/ibiHtkEdkNVD0zI1GJm27OZ0++KV+eDftdFnvVVLZzurEpt8URXArA/S40v229BeteOmgIK7RFoBl60zYgIKyOb+kJIEznam4NWDKbUeIwhu8l+tuyHMxd+L4GDECKjcC5AexJCMGHq6INPaOj6SarOZIS7craevFGSAk8Cu2+uqGEHSwcKWYXSCjv/Bc0vLdJ9lisw6eRFu9bPtG8PUWWsa3KL9sMIbVabjBqA7gMtS202QKLakTa9D4E9yGYht/vdK5byye351RsxbTwQi17SdIRSl6jltikOTzU5NMC2NvCg40nZSSLIFJRmPD3Nijdj6oPbViUKZ2+v5/3A8B3i25uLWm1a392q8RFTK/54xR0u6QEUCCxDRNFu8WVUQoJXiOcwuN0oaGwCf2RcVwNsJTOfM4nSfqVh7lzngV3bjp1sSMUvlz64NLGlBNDr5Qn9z2BxULFcTEXHB10cvNSvPpJFSPbpi2jR0BT6QyeFgkQo8Xy6p34ZRdzEAWDbG3EZcs7hmMoMv/ MIQUVqOF I7w0sroIC/4mgwcjyMteRHRfo6jx9znwUkyhxuQZPmggzNUJLEgfIHs1EU4F4Q5HbgV3vXLoQJ3P/yOyy1Y9Z/Eqtja9+0Uj/RlZ28zDVVMqrmm9GV2O0DgJyUm783jf6wRDbvwWBWqu1UP9QzZV+BG9B82Hjt8bANmqhV80haIit8QcoaZt/IdmohWrfQzscOJ62YdmYOeGaqRguCDu9LAa54p9LqM96l6HUFs5o0mXuMyekqsE8eGao5IJTScuA0n3bM4zCGuWWT2bRmtTlLjtm2/xbhG6THS0g/AuR7Tkg/mTN+ibn9FKa2hgqnsYP+Rko3naXebJONseLzdaKS2XQW9KxP27KfCwIfDNtq6pBVqSqcXUD5XJIlX//MqVVREwhhqeJB7kx1IyUBkrJ/ZNZcRJvgE7ed9yybWxm1ESO0O5HAy4eiGL1865bmOROGqfK X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Feb 18, 2025 at 9:07=E2=80=AFAM Ryan Roberts = wrote: > > On 11/02/2025 00:30, Nico Pache wrote: > > The following series provides khugepaged and madvise collapse with the > > capability to collapse regions to mTHPs. > > > > To achieve this we generalize the khugepaged functions to no longer dep= end > > on PMD_ORDER. Then during the PMD scan, we keep track of chunks of page= s > > (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > > using a bitmap. After the PMD scan is done, we do binary recursion on t= he > > bitmap to find the optimal mTHP sizes for the PMD range. The restrictio= n > > on max_ptes_none is removed during the scan, to make sure we account fo= r > > the whole PMD range. max_ptes_none will be scaled by the attempted coll= apse > > order to determine how full a THP must be to be eligible. If a mTHP col= lapse > > is attempted, but contains swapped out, or shared pages, we dont perfor= m the > > collapse. > > > > With the default max_ptes_none=3D511, the code should keep its most of = its > > original behavior. To exercise mTHP collapse we need to set max_ptes_no= ne<=3D255. > > With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep= " and > > nit: I think you mean "max_ptes_none >=3D HPAGE_PMD_NR/2" (greater or *eq= ual*)? > This is making my head hurt, but I *think* I agree with you that if > max_ptes_none is less than half of the number of ptes in a pmd, then cree= p > doesn't happen. Haha yea the compressed bitmap does not make the math super easy to follow, but i'm glad we arrived at the same conclusion :) > > To make sure I've understood; > > - to collapse to 16K, you would need >=3D3 out of 4 PTEs to be present > - to collapse to 32K, you would need >=3D5 out of 8 PTEs to be present > - to collapse to 64K, you would need >=3D9 out of 16 PTEs to be present > - ... > > So if we start with 3 present PTEs in a 16K area, we collapse to 16K and = now > have 4 PTEs in a 32K area which is insufficient to collapse to 32K. > > Sounds good to me! Great! Another easy way to think about it is, with max_ptes_none =3D HPAGE_PMD_NR/2, a collapse will double the size, and we only need half for it to collapse again. Each size is 2x the last, so if we hit one collapse, it will be eligible again next round. > > > constantly promote mTHPs to the next available size. > > > > Patch 1: Some refactoring to combine madvise_collapse and khugepage= d > > Patch 2: Refactor/rename hpage_collapse > > Patch 3-5: Generalize khugepaged functions for arbitrary orders > > Patch 6-9: The mTHP patches > > > > --------- > > Testing > > --------- > > - Built for x86_64, aarch64, ppc64le, and s390x > > - selftests mm > > - I created a test script that I used to push khugepaged to its limits = while > > monitoring a number of stats and tracepoints. The code is available > > here[1] (Run in legacy mode for these changes and set mthp sizes to = inherit) > > The summary from my testings was that there was no significant regre= ssion > > noticed through this test. In some cases my changes had better colla= pse > > latencies, and was able to scan more pages in the same amount of tim= e/work, > > but for the most part the results were consistant. > > - redis testing. I tested these changes along with my defer changes > > (see followup post for more details). > > - some basic testing on 64k page size. > > - lots of general use. These changes have been running in my VM for som= e time. > > > > Changes since V1 [2]: > > - Minor bug fixes discovered during review and testing > > - removed dynamic allocations for bitmaps, and made them stack based > > - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > > - Updated trace events to include collapsing order info. > > - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > > - No longer require a chunk to be fully utilized before setting the bit= . Use > > the same max_ptes_none scaling principle to achieve this. > > - Skip mTHP collapse that requires swapin or shared handling. This help= s prevent > > some of the "creep" that was discovered in v1. > > > > [1] - https://gitlab.com/npache/khugepaged_mthp_test > > [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat= .com/ > > > > Nico Pache (9): > > introduce khugepaged_collapse_single_pmd to unify khugepaged and > > madvise_collapse > > khugepaged: rename hpage_collapse_* to khugepaged_* > > khugepaged: generalize hugepage_vma_revalidate for mTHP support > > khugepaged: generalize alloc_charge_folio for mTHP support > > khugepaged: generalize __collapse_huge_page_* for mTHP support > > khugepaged: introduce khugepaged_scan_bitmap for mTHP support > > khugepaged: add mTHP support > > khugepaged: improve tracepoints for mTHP orders > > khugepaged: skip collapsing mTHP to smaller orders > > > > include/linux/khugepaged.h | 4 + > > include/trace/events/huge_memory.h | 34 ++- > > mm/khugepaged.c | 422 +++++++++++++++++++---------- > > 3 files changed, 306 insertions(+), 154 deletions(-) > > >