From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 312F7C021A0 for ; Thu, 13 Feb 2025 19:40:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BB1006B007B; Thu, 13 Feb 2025 14:40:02 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id B6096280001; Thu, 13 Feb 2025 14:40:02 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A01666B0088; Thu, 13 Feb 2025 14:40:02 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 86FD16B007B for ; Thu, 13 Feb 2025 14:40:02 -0500 (EST) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay03.hostedemail.com (Postfix) with ESMTP id 018BEA0228 for ; Thu, 13 Feb 2025 19:40:01 +0000 (UTC) X-FDA: 83115937044.24.F247A09 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id A6A331A0014 for ; Thu, 13 Feb 2025 19:39:59 +0000 (UTC) Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WXvhnW+v; spf=pass (imf19.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739475599; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=boxr+nl3iRxI8kwf7+INrSdGJaLkpDQVEcqU+I9NPyg=; b=ERN24hirdumcCsPN5j/y7O1ZR9zAZTu5ybUhYkfAJjfgu34J/Wh5EQWG9jf2xVDKi/qJdd YvYhHYENw0Y0EYig3yqo/peIjcAxryAv0E7hOQywXubJNzWM2IAT4RDiSvrj8nvS+HxmSl tQAyYEjEyb2W5dZP6Z+5JqFVkbYqOYA= ARC-Authentication-Results: i=1; imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=WXvhnW+v; spf=pass (imf19.hostedemail.com: domain of npache@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=npache@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739475599; a=rsa-sha256; cv=none; b=6EygZ5kjrRtYC7YFsT/sgxrQvPDsK47YwG1wN4A110pewVoUB/rLE+INpDcZe1fZA+m76a 7xgc43IWKZQi8PMaLnQabLl/kwI3Pib5ayn0owhh6hOMTEhVi51G9yYi/uUKoU4lWUpjX7 ifMMbEpA5rDWKhqstnX6l2CDFB3vQEY= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1739475599; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=boxr+nl3iRxI8kwf7+INrSdGJaLkpDQVEcqU+I9NPyg=; b=WXvhnW+vRUVS575g65rN8Nkr4I2nmMCNRH05ufPu8IFSWhAqNv5I8321AzUGTXANhgmMSM p25tUlFOsykOgkwZ4xuWnKBcvy292YsfplDgtPGuaPBHNIqdVdv0b/swkyFX5eStom49AC +TxOojOmB6a+3Dtr/xxy+W6mLwssJ4k= Received: from mail-yb1-f199.google.com (mail-yb1-f199.google.com [209.85.219.199]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-189-e3VPySPlNAiajofgg55D2A-1; Thu, 13 Feb 2025 14:39:57 -0500 X-MC-Unique: e3VPySPlNAiajofgg55D2A-1 X-Mimecast-MFC-AGG-ID: e3VPySPlNAiajofgg55D2A_1739475597 Received: by mail-yb1-f199.google.com with SMTP id 3f1490d57ef6-e54da6701d2so1947558276.0 for ; Thu, 13 Feb 2025 11:39:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739475597; x=1740080397; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=boxr+nl3iRxI8kwf7+INrSdGJaLkpDQVEcqU+I9NPyg=; b=wcorNH+nPKgrT714TInoXhP/AbUhxaVrCoEhha+zJPs6LENItXVlAeQg9Q468iJCC/ 6bNB2PtIkuz4gvBKeUNdaEEqwTMyomiepdDDpgfD2OIo2x1yp/AGu50j3uTVzyOHxyNZ GKjgbHaExFo5nXVsCjHpU5SKdxZzoXOHV65LqL0tP8FUa/WHAzFCd8XRdQLx3FAPuOl9 C/t210u9WplcaVVqlAdGBBkx+u/ziKyu2cYjrDYhZF/YHqfI79tWjre0rh+z8Bhw68xM ghqaxd2s8SYvWGruip3tkOsU3fEe26BCxAhSjVi6QfqGM8JJw1SSd9KWM8WhwXAODyzu bVtA== X-Forwarded-Encrypted: i=1; AJvYcCWJkSiivTm0mq73Q0uOHGRU79cyKY/BExAg6L0DsrEQrsBx8RUY53BHQZTNVBXEvHktgmnRMy4nHw==@kvack.org X-Gm-Message-State: AOJu0Yx5hLX3vkgNjTwtXMlc9kQuEM9jyekfNy08NjAHmwS2IW1Ur66C jCwI3PnMtcTW3nuHB8PhTKyYvmuMllPa0p3MOI1THxj66hLXX6Ub4X+XxzQIORfEVd3imYD9PIX BecwzJaZxjWkHrbWhDUCUvK7lMq8pasJIWaKN3gjpMClFU3dbMPx+ei9ko2e0a5FeeeZBPOZEKJ +43w8kj/3ebEN65cIBfZtne4k= X-Gm-Gg: ASbGncvhCYvhiUeo78DZ0fNnX4o6Alql1XAZhshRH4rrJ8FrHKdHD9iAKWVxl3cIeTp 1+5Q0ECD638aFSoenn6K7VrhRJliwwIEcBCm1KLBo8b5q+PGjvRumjL44pl5CYh4Ib7C1m4F73A k= X-Received: by 2002:a05:6902:13c6:b0:e5b:4080:ad89 with SMTP id 3f1490d57ef6-e5da81e9e95mr4522293276.38.1739475597115; Thu, 13 Feb 2025 11:39:57 -0800 (PST) X-Google-Smtp-Source: AGHT+IHfp2KDNuPnNnUtf/P7iXlml3POI4m+epxhSgL0vVylmbj3I2dY+iGX0H+VqmLQZwtrr+P5xL+59PtJpBPimPg= X-Received: by 2002:a05:6902:13c6:b0:e5b:4080:ad89 with SMTP id 3f1490d57ef6-e5da81e9e95mr4522251276.38.1739475596599; Thu, 13 Feb 2025 11:39:56 -0800 (PST) MIME-Version: 1.0 References: <20250211003028.213461-1-npache@redhat.com> <5a995dc9-fee7-442f-b439-c484d9de1750@arm.com> <4ba52062-1bd3-4d53-aa28-fcbbd4913801@arm.com> In-Reply-To: <4ba52062-1bd3-4d53-aa28-fcbbd4913801@arm.com> From: Nico Pache Date: Thu, 13 Feb 2025 12:39:30 -0700 X-Gm-Features: AWEUYZltvpqxvurvt8_8gd179zfEGjGrl_F1jfl5niXu58DgsMbTO7GZCLo_WGk Message-ID: Subject: Re: [RFC v2 0/9] khugepaged: mTHP support To: Dev Jain Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, ryan.roberts@arm.com, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: Gj95KRtvJ4l9bmMXlicIOdbKNrLhVK8_4X05ok8cN-o_1739475597 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspam-User: X-Rspamd-Server: rspam09 X-Rspamd-Queue-Id: A6A331A0014 X-Stat-Signature: 4b4rk9wkkwjap4mohbig4szj8o3xinof X-HE-Tag: 1739475599-431670 X-HE-Meta: U2FsdGVkX1+AOCq+Bvr2QoZ1opkvbTMWjl0mm1qKJBmc/mrp3aqpk9agtSu3TjQdooIxzR3cHlHwzdmRZ7Gx6RorbfkOX3yWKB312DXgNzM1ackgshiZXzFpzPrbqIQX5E24QVz77JQUb4prHp219XFo8CTUK+QS6e7Q3jOnLnSQkUxjwY23eXqIOMf0KgfmoX/WjbO53yl3jgNEohRmtZYhLZ7bcnMBebNiz/YJfNYt/Dt43TtKEIbrBFLzeAzpc7f2ViLkRjHhShfXx0Sv5FiPg6Aqfva12jNQmiKqfyKQg8QYgFw5tn4il5hFLPwXFTp5CBblHyI30j4cQCvirpgJvVySogFaiC/gkLMRE+4gJT945ZqCsoGFmhG9P9zW+5VVU4+0SoSIWP/EPkDHnRh+ifmnjV+y6bDleuNn3MFy29YsLGFNlKZ/63bUOvJopKL9h6B/0BZRua39PrdxoLk8fjUZYpHxl+0ad6X+xbfhDL1/hh2fCe/OX7f4lypnF4Jb+VRA3Wb7hZhg46zZnTyLCGZmr0fBA+a/XUTzOLnPQL8nX8Nn0ggLTb7IBOvRyeC472ek2KdRcViVcL3sXvGwPCZ3yxv2D8StUcF9Wxnzdl49rfudqh8O87cFWPFZIbtdpOLy8cIMm7TzUeCDM1QcwRD021eUufYsGTfb2TQAb7J6ZVQyD3XHmL63gP85eGkcd+jLmvar4xfFWDHqRb1wYjYH6opT7CeOfbsKWBy7QLPMf26PGb2d7t21RH/PUcENbhyQaz5bTNoWMuF3VANeZD/i/+gXI2TAN0etdzrlBy5aBoAh6a7uNw+TdQyqLa+GLoknvZT7gDd2f/Pr0wECLLeDMSflXAe2PpghNdfL/B+FB3XOZICRpDEYcRKGIP1RsfLHp2+i5xiA5Tr/GGriOCvkJsHo+qXOeQL8JWomZliMS4ZRAdoVwLWONWnLKXwjiXlkJtiVheZF+IL KHOYtCnM 9v/z/wrovNtjONIcX3ddbmTBu4pvjJ6xEdpQuzG/NEXtSVwqxdufPEWUbMNj7e95Ov1Yh63J8FntE1wvP3MO9TuYlM0g9Arfum3ZL3nNkCiBozhoEqxxxKrEsXakFiwxy2YYQQQGzusUb8hRWDtyjPsrq5/pWq8f2RrjeL5ACbEI6OXQa3GnMOCsywU+OXkHUexC0tfxkLQ1B8qu28FUEAQj3Ag5umO3JXE1+lQAvd8ahhFT/++2NDLJP+k2YMjZv901szZfH2V7vf/zN+Qje0u6JuR3iWKo80tXw525GoH4gTn7hMIzAvajTjVPqVcngdn78lqRKUjB6e4pAK3Ukb3SjpEzbvRCf8dGEgkEEsti0QQnR0l98qOkdzpmEnkuXAmvhfATq9kEbtSDJ5fpNivplekhP5vtl5WLb/nQOoUJDh8GOAQeC7rWU3gjo103deMUynf2hAqz8YKZYgSalXXprD2LTAn3lghgn X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Feb 13, 2025 at 1:26=E2=80=AFAM Dev Jain wrote: > > > > On 12/02/25 10:19 pm, Nico Pache wrote: > > On Tue, Feb 11, 2025 at 5:50=E2=80=AFAM Dev Jain wro= te: > >> > >> > >> > >> On 11/02/25 6:00 am, Nico Pache wrote: > >>> The following series provides khugepaged and madvise collapse with th= e > >>> capability to collapse regions to mTHPs. > >>> > >>> To achieve this we generalize the khugepaged functions to no longer d= epend > >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pa= ges > >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked > >>> using a bitmap. After the PMD scan is done, we do binary recursion on= the > >>> bitmap to find the optimal mTHP sizes for the PMD range. The restrict= ion > >>> on max_ptes_none is removed during the scan, to make sure we account = for > >>> the whole PMD range. max_ptes_none will be scaled by the attempted co= llapse > >>> order to determine how full a THP must be to be eligible. If a mTHP c= ollapse > >>> is attempted, but contains swapped out, or shared pages, we dont perf= orm the > >>> collapse. > >>> > >>> With the default max_ptes_none=3D511, the code should keep its most o= f its > >>> original behavior. To exercise mTHP collapse we need to set max_ptes_= none<=3D255. > >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "cre= ep" and > >>> constantly promote mTHPs to the next available size. > >>> > >>> Patch 1: Some refactoring to combine madvise_collapse and khugepa= ged > >>> Patch 2: Refactor/rename hpage_collapse > >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders > >>> Patch 6-9: The mTHP patches > >>> > >>> --------- > >>> Testing > >>> --------- > >>> - Built for x86_64, aarch64, ppc64le, and s390x > >>> - selftests mm > >>> - I created a test script that I used to push khugepaged to its limit= s while > >>> monitoring a number of stats and tracepoints. The code is availa= ble > >>> here[1] (Run in legacy mode for these changes and set mthp sizes= to inherit) > >>> The summary from my testings was that there was no significant r= egression > >>> noticed through this test. In some cases my changes had better c= ollapse > >>> latencies, and was able to scan more pages in the same amount of= time/work, > >>> but for the most part the results were consistant. > >>> - redis testing. I tested these changes along with my defer changes > >>> (see followup post for more details). > >>> - some basic testing on 64k page size. > >>> - lots of general use. These changes have been running in my VM for s= ome time. > >>> > >>> Changes since V1 [2]: > >>> - Minor bug fixes discovered during review and testing > >>> - removed dynamic allocations for bitmaps, and made them stack based > >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. > >>> - Updated trace events to include collapsing order info. > >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. > >>> - No longer require a chunk to be fully utilized before setting the b= it. Use > >>> the same max_ptes_none scaling principle to achieve this. > >>> - Skip mTHP collapse that requires swapin or shared handling. This he= lps prevent > >>> some of the "creep" that was discovered in v1. > >>> > >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test > >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redh= at.com/ > >>> > >>> Nico Pache (9): > >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and > >>> madvise_collapse > >>> khugepaged: rename hpage_collapse_* to khugepaged_* > >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support > >>> khugepaged: generalize alloc_charge_folio for mTHP support > >>> khugepaged: generalize __collapse_huge_page_* for mTHP support > >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support > >>> khugepaged: add mTHP support > >>> khugepaged: improve tracepoints for mTHP orders > >>> khugepaged: skip collapsing mTHP to smaller orders > >>> > >>> include/linux/khugepaged.h | 4 + > >>> include/trace/events/huge_memory.h | 34 ++- > >>> mm/khugepaged.c | 422 +++++++++++++++++++------= ---- > >>> 3 files changed, 306 insertions(+), 154 deletions(-) > >>> > >> > >> Does this patchset suffer from the problem described here: > >> https://lore.kernel.org/all/8abd99d5-329f-4f8d-8680-c2d48d4963b6@arm.c= om/ > > Hi Dev, > > > > Sorry I meant to get back to you about that. > > > > I understand your concern, but like I've mentioned before, the scan > > with the read lock was done so we dont have to do the more expensive > > locking, and could still gain insight into the state. You are right > > that this info could become stale if the state changes dramatically, > > but the collapse_isolate function will verify it and not collapse. > > If the state changes dramatically, the _isolate function will verify it, > and fallback. And this fallback happens after following this costly > path: retrieve a large folio from the buddy allocator -> swapin pages > from the disk -> mmap_write_lock() -> anon_vma_lock_write() -> TLB flush > on all CPUs -> fallback in _isolate(). > If you do fail in _isolate(), doesn't it make sense to get the updated > state for the next fallback order immediately, because we have prior > information that we failed because of PTE state? What your algorithm > will do is *still* follow the costly path described above, and again > fail in _isolate(), instead of failing in hpage_collapse_scan_pmd() like > mine would. You do raise a valid point here, I can optimize my solution by detecting certain collapse failure types and jump to the next scan. I'll add that to my solution, thanks! As for the disagreement around the bitmap, we'll leave that up to the community to decide since we have differing opinions/solutions. > > The verification of the PTE state by the _isolate() function is the "no > turning back" point of the algorithm. The verification by > hpage_collapse_scan_pmd() is the "let us see if proceeding is even worth > it, before we do costly operations" point of the algorithm. > > > From my testing I found this to rarely happen. > > Unfortunately, I am not very familiar with performance testing/load > testing, I am fairly new to kernel programming, so I am getting there. > But it really depends on the type of test you are running, what actually > runs on memory-intensive systems, etc etc. In fact, on loaded systems I > would expect the PTE state to dramatically change. But still, no opinion > here. Yeah there are probably some cases where it happens more often. Probably in cases of short lived allocations, but khugepaged doesn't run that frequently so those won't be that big of an issue. Our performance team is currently testing my implementation so I should have more real workload test results soon. The redis testing had some gains and didn't show any signs of obvious regressions. As for the testing, check out https://gitlab.com/npache/khugepaged_mthp_test/-/blob/master/record-khuge-p= erformance.sh?ref_type=3Dheads this does the tracing for my testing script. It can help you get started. There are 3 different traces being applied there: the bpftrace for collapse latencies, the perf record for the flamegraph (not actually that useful, but may be useful to visualize any weird/long paths that you may not have noticed), and the trace-cmd which records the tracepoint of the scan and the collapse functions then processes the data using the awk script-- the output being the scan rate, the pages collapsed, and their result status (grouped by order). You can also look into https://github.com/gormanm/mmtests for testing/comparing kernels. I was running the config-memdb-redis-benchmark-medium workload. > > > > > Also, khugepaged, my changes, and your changes are all a victim of > > this. Once we drop the read lock (to either allocate the folio, or > > right before acquiring the write_lock), the state can change. In your > > case, yes, you are gathering more up to date information, but is it > > really that important/worth it to retake locks and rescan for each > > instance if we are about to reverify with the write lock taken? > > You said "reverify": You are removing the verification, so this step > won't be reverification, it will be verification. We do not want to > verify *after* we have already done 95% of latency-heavy stuff, only to > know that we are going to fail. > > Algorithms in the kernel, in general, are of the following form: 1) > Verify if a condition is true, resulting in taking a control path -> 2) > do a lot of stuff -> "no turning back" step, wherein before committing > (by taking locks, say), reverify if this is the control path we should > be in. You are eliminating step 1). > > Therefore, I will have to say that I disagree with your approach. > > On top of this, in the subjective analysis in [1], point number 7 (along > with point number 1) remains. And, point number 4 remains. for 1) your worst case of 1024 is not the worst case. There are 8 possible orders in your implementation, if all are enabled, that is 4096 iterations in the worst case. This becomes WAY worse on 64k page size, ~45,000 iterations vs 4096 in my c= ase. > > [1] > https://lore.kernel.org/all/23023f48-95c6-4a24-ac8b-aba4b1a441b4@arm.com/ > > > > > So in my eyes, this is not a "problem" > > Looks like the kernel scheduled us for a high-priority debate, I hope > there's no deadlock :) > > > > > Cheers, > > -- Nico > > > > > >> > > >