From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id AE00FC021AA for ; Wed, 19 Feb 2025 09:01:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 3A15228020B; Wed, 19 Feb 2025 04:01:28 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 351BA280205; Wed, 19 Feb 2025 04:01:28 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2404C28020B; Wed, 19 Feb 2025 04:01:28 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10]) by kanga.kvack.org (Postfix) with ESMTP id 0623E280205 for ; Wed, 19 Feb 2025 04:01:27 -0500 (EST) Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id BC6751C8427 for ; Wed, 19 Feb 2025 09:01:27 +0000 (UTC) X-FDA: 83136100614.04.0C619DF Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf22.hostedemail.com (Postfix) with ESMTP id DD154C0007 for ; Wed, 19 Feb 2025 09:01:25 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1739955686; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=Zf6JCh/RUXF9Ve1wFHK8sh1jtADWpwd4RzMuPlW9PnQ=; b=I8QnrCrkeSOmNiVwyT1dMKK2CiHBYgPngRnP1i4iTJoEYPh9g2tfEM9Eus6nxRYS+EdPWV yNJrQqPeyUkclS9JiqvQUd8nlwKIuVd1d5DDsLSn9Ao3BO0COsem1Ja7UEjYHJGaAUzd4V Xo6SPdRE7erF0iVOKpblLqZmK5IPSMc= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=none; spf=pass (imf22.hostedemail.com: domain of dev.jain@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=dev.jain@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1739955686; a=rsa-sha256; cv=none; b=jmO3wqR8MgWXPB8J/sjQXSzb6iDPBs1c6LsjI+sSZHftiBGTFYqgaq/EpCAAN/DxrQZH9F 98WQwQGuRGoTy+78w0ZpaI+ximrJ4HBu/EqiIyW9tfKbVF+kD4TNKM4LQweOEJKVGQCXCz AnctFd0Zck7Gs1T6v2vE4+R5NppJNEk= Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id AAE981682; Wed, 19 Feb 2025 01:01:43 -0800 (PST) Received: from [10.162.42.15] (K4MQJ0H1H2.blr.arm.com [10.162.42.15]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 2CB183F5A1; Wed, 19 Feb 2025 01:01:12 -0800 (PST) Message-ID: <867280bf-2ba1-4e83-8e16-9d93e1c41e08@arm.com> Date: Wed, 19 Feb 2025 14:31:09 +0530 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC v2 0/9] khugepaged: mTHP support To: Nico Pache , Ryan Roberts Cc: linux-kernel@vger.kernel.org, linux-trace-kernel@vger.kernel.org, linux-mm@kvack.org, anshuman.khandual@arm.com, catalin.marinas@arm.com, cl@gentwo.org, vbabka@suse.cz, mhocko@suse.com, apopple@nvidia.com, dave.hansen@linux.intel.com, will@kernel.org, baohua@kernel.org, jack@suse.cz, srivatsa@csail.mit.edu, haowenchao22@gmail.com, hughd@google.com, aneesh.kumar@kernel.org, yang@os.amperecomputing.com, peterx@redhat.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, ziy@nvidia.com, jglisse@google.com, surenb@google.com, vishal.moola@gmail.com, zokeefe@google.com, zhengqi.arch@bytedance.com, jhubbard@nvidia.com, 21cnbao@gmail.com, willy@infradead.org, kirill.shutemov@linux.intel.com, david@redhat.com, aarcange@redhat.com, raquini@redhat.com, sunnanyong@huawei.com, usamaarif642@gmail.com, audra@redhat.com, akpm@linux-foundation.org, rostedt@goodmis.org, mathieu.desnoyers@efficios.com, tiwai@suse.de References: <20250211003028.213461-1-npache@redhat.com> <8a37f99b-f207-4688-bc90-7f8e6900e29d@arm.com> Content-Language: en-US From: Dev Jain In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Queue-Id: DD154C0007 X-Stat-Signature: gwh8cwbdmkuq3zcsyn4c5ii568go8cjs X-Rspamd-Server: rspam03 X-HE-Tag: 1739955685-147615 X-HE-Meta: U2FsdGVkX1+tp4YSVJvGgOwHRbU0JEER0iyEV5fFkkvyu+NSZwhr0chY0KPkdlpYeQzkJdB1IqDvN/Dk5mjQbif8n47a6jxqOjVnvmjVmL11Z4tZI2G2laOproU8fSWHK3TGGlorAlXRmkBNktpggWEgDQHRlD87JqUn5Akk39RwVv+vjxpullaxT9Vkahced5zZctL9B2XZX4i4s3mBDoknD9n26a4VcQ2mVKRBrcD4hwMqzUGp4LFdomecAyR0Sas6SYvYuEtnWul4heG1q44aw01OcYD+2H/JHqa5vB/IIypXwEvmiUp6wFIbLIV5NLnIa9KzFEOCBXbQ1PL92XWE7YbMeU+MqWVHqKTnFfBrCNaPrs0VVQEuXRBW9gsO2m5tKJW79C1Ov3vSNDiDitKaVSLEQ7tSV4yMct8HssnD79hhQt1ppdCo7HmGVB7B3a9fp/p+1FbsIUgzaHq6QMctba6ICvSgPGk8NbSU0ghwYuKCIfOQUPE7RP5ET/VhsNzrBT8/bgI9h1WQxN3KrC2SDvAcAjKpHdLkYy6T07H0OHi0DZC76MF2XQs3rjQuZy6HQ1BU39t78zjC3fIZACoyijvJOdrFHQHAViIj1FYtNjk+p4H272RqCehL9MYqrkc50po+Fyse5UCbGDSqiICXTjo9lyisumsHnPXyAgo5WLXm2Gefi4yJxcuHJyq3P4WaSvo1BeueQ2lnskKaOo0yMVhLug69VP0ZDysCOqcfLtuevA5AWq1sczuPjVwRXsnO4HOFGRwkd6CB4VuEGua9pYyecefw/n2q70uRwLiLBwpJz/jFY82hG5MdQOnjvsP1Yq+usTQRlwdoJtMd+8LZ7t83SsFHMix1mGHIW4Z8qRsxC3qtZXQaX1FeKgP17Uj1vUhJWJZhoePq+xBat9oZgIhbDBYrTgTOGPkvk3EJMM4ULyvhJfDk8ld/Md09QTgF+xCIZBUtoUQWhGe /DneQqaD tA6wC7zWF8ebSOIvORqPhoPU8HG27L6PXi3K4wGkgGAGUta1PEccOiM9wEoSDmglwnFAkTPf2HhIfP9tdr5v5CL3umbQL86VSeRuzU//ozCXeN7p3BJQVI00UCfHBtV5ieMefmgX1TByHC3iMo1Ob/Kr+fqiL/NVXiA/AW3LHzFX6BMyhOR7QhOMutkFSX9z7sUdNmQ6J1tDjtiy3uvYIjoy3HXRa4ZH+yaPe0Kn7mquGJGQ5HkXN6adQO0U3jR9g55tLoHTfZr1+HG8Uh57/ifMrezsr66V3x72hLfzX8ciZbQJKJNSPZ18N1LgSRwUJbLAiToPT+yFUe5gZG/wBgOlsR1OBWEHGJcPcHXn6mK9rfYG/p6BDj2XwYA== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 19/02/25 4:00 am, Nico Pache wrote: > On Tue, Feb 18, 2025 at 9:07 AM Ryan Roberts wrote: >> >> On 11/02/2025 00:30, Nico Pache wrote: >>> The following series provides khugepaged and madvise collapse with the >>> capability to collapse regions to mTHPs. >>> >>> To achieve this we generalize the khugepaged functions to no longer depend >>> on PMD_ORDER. Then during the PMD scan, we keep track of chunks of pages >>> (defined by MTHP_MIN_ORDER) that are utilized. This info is tracked >>> using a bitmap. After the PMD scan is done, we do binary recursion on the >>> bitmap to find the optimal mTHP sizes for the PMD range. The restriction >>> on max_ptes_none is removed during the scan, to make sure we account for >>> the whole PMD range. max_ptes_none will be scaled by the attempted collapse >>> order to determine how full a THP must be to be eligible. If a mTHP collapse >>> is attempted, but contains swapped out, or shared pages, we dont perform the >>> collapse. >>> >>> With the default max_ptes_none=511, the code should keep its most of its >>> original behavior. To exercise mTHP collapse we need to set max_ptes_none<=255. >>> With max_ptes_none > HPAGE_PMD_NR/2 you will experience collapse "creep" and >> >> nit: I think you mean "max_ptes_none >= HPAGE_PMD_NR/2" (greater or *equal*)? >> This is making my head hurt, but I *think* I agree with you that if >> max_ptes_none is less than half of the number of ptes in a pmd, then creep >> doesn't happen. > Haha yea the compressed bitmap does not make the math super easy to > follow, but i'm glad we arrived at the same conclusion :) >> >> To make sure I've understood; >> >> - to collapse to 16K, you would need >=3 out of 4 PTEs to be present >> - to collapse to 32K, you would need >=5 out of 8 PTEs to be present >> - to collapse to 64K, you would need >=9 out of 16 PTEs to be present >> - ... >> >> So if we start with 3 present PTEs in a 16K area, we collapse to 16K and now >> have 4 PTEs in a 32K area which is insufficient to collapse to 32K. >> >> Sounds good to me! > Great! Another easy way to think about it is, with max_ptes_none = > HPAGE_PMD_NR/2, a collapse will double the size, and we only need half > for it to collapse again. Each size is 2x the last, so if we hit one > collapse, it will be eligible again next round. Please someone correct me if I am wrong. Consider this; you are collapsing a 256K folio. => #PTEs = 256K/4K = 64 => #chunks = 64 / 8 = 8. Let the PTE state within the chunks be as follows: Chunk 0: < 5 filled Chunk 1: 5 filled Chunk 2: 5 filled Chunk 3: 5 filled Chunk 4: 5 filled Chunk 5: < 5 filled Chunk 6: < 5 filled Chunk 7: < 5 filled Consider max_ptes_none = 40% (512 * 40 / 100 = 204.8 (round down) = 204 < HPAGE_PMD_NR/2). => To collapse we need at least 60% of the PTEs filled. Your algorithm marks chunks in the bitmap if 60% of the chunk is filled. Then, if the number of chunks set is greater than 60%, then we will collapse. Chunk 0 will be marked zero because less than 5 PTEs are filled => percentage filled <= 50% Right now the state is 0111 1000 where the indices are the chunk numbers. Since #1s = 4 => percent filled = 4/8 * 100 = 50%, 256K folio collapse won't happen. For the first 4 chunks, the percent filled is 75%. So the state becomes 1111 1000 after 128K collapse, and now 256K collapse will happen. Either I got this correct, or I do not understand the utility of maintaining chunks :) What you are doing is what I am doing except that my chunk size = 1. >> >>> constantly promote mTHPs to the next available size. >>> >>> Patch 1: Some refactoring to combine madvise_collapse and khugepaged >>> Patch 2: Refactor/rename hpage_collapse >>> Patch 3-5: Generalize khugepaged functions for arbitrary orders >>> Patch 6-9: The mTHP patches >>> >>> --------- >>> Testing >>> --------- >>> - Built for x86_64, aarch64, ppc64le, and s390x >>> - selftests mm >>> - I created a test script that I used to push khugepaged to its limits while >>> monitoring a number of stats and tracepoints. The code is available >>> here[1] (Run in legacy mode for these changes and set mthp sizes to inherit) >>> The summary from my testings was that there was no significant regression >>> noticed through this test. In some cases my changes had better collapse >>> latencies, and was able to scan more pages in the same amount of time/work, >>> but for the most part the results were consistant. >>> - redis testing. I tested these changes along with my defer changes >>> (see followup post for more details). >>> - some basic testing on 64k page size. >>> - lots of general use. These changes have been running in my VM for some time. >>> >>> Changes since V1 [2]: >>> - Minor bug fixes discovered during review and testing >>> - removed dynamic allocations for bitmaps, and made them stack based >>> - Adjusted bitmap offset from u8 to u16 to support 64k pagesize. >>> - Updated trace events to include collapsing order info. >>> - Scaled max_ptes_none by order rather than scaling to a 0-100 scale. >>> - No longer require a chunk to be fully utilized before setting the bit. Use >>> the same max_ptes_none scaling principle to achieve this. >>> - Skip mTHP collapse that requires swapin or shared handling. This helps prevent >>> some of the "creep" that was discovered in v1. >>> >>> [1] - https://gitlab.com/npache/khugepaged_mthp_test >>> [2] - https://lore.kernel.org/lkml/20250108233128.14484-1-npache@redhat.com/ >>> >>> Nico Pache (9): >>> introduce khugepaged_collapse_single_pmd to unify khugepaged and >>> madvise_collapse >>> khugepaged: rename hpage_collapse_* to khugepaged_* >>> khugepaged: generalize hugepage_vma_revalidate for mTHP support >>> khugepaged: generalize alloc_charge_folio for mTHP support >>> khugepaged: generalize __collapse_huge_page_* for mTHP support >>> khugepaged: introduce khugepaged_scan_bitmap for mTHP support >>> khugepaged: add mTHP support >>> khugepaged: improve tracepoints for mTHP orders >>> khugepaged: skip collapsing mTHP to smaller orders >>> >>> include/linux/khugepaged.h | 4 + >>> include/trace/events/huge_memory.h | 34 ++- >>> mm/khugepaged.c | 422 +++++++++++++++++++---------- >>> 3 files changed, 306 insertions(+), 154 deletions(-) >>> >> > >