From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id CD41FC36008 for ; Tue, 17 Sep 2024 10:19:28 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6425C6B0088; Tue, 17 Sep 2024 06:19:28 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5C9696B0089; Tue, 17 Sep 2024 06:19:28 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 443536B008A; Tue, 17 Sep 2024 06:19:28 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 2236C6B0088 for ; Tue, 17 Sep 2024 06:19:28 -0400 (EDT) Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 8BDC4802D5 for ; Tue, 17 Sep 2024 10:19:27 +0000 (UTC) X-FDA: 82573833174.21.E0921AC Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by imf08.hostedemail.com (Postfix) with ESMTP id B337516001D for ; Tue, 17 Sep 2024 10:19:25 +0000 (UTC) Authentication-Results: imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726568255; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GjjUu5SNjDfFCpPr7b2VICxXCUL34O+GDzr3nri9SKM=; b=eKXSZ8eMLDsSaQ4FSpXSU/vrCI9ndFFgMQ3dX4x9RE+AoqEli7tEAtLQuiO5pnkfgVtn9P eqMz7dPDQbPeZflqYWGK+uvFHyAeiCcjwRWleKK0EdLFlltITGa76OxMTXHskwoUbAVsKR J8gEFJU/LIxcXwlohIETgORw+/0Q/IM= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726568255; a=rsa-sha256; cv=none; b=YpQF7BRGoPOM3NFbqN1nybKX4OD0lYF/ToEcFvUHABmkMUyD+lk1olj1f6RDWSe3RP1vqx x0ePp3y/btWwEuKjtpNiTB1FDZFbxbpId4HxtS/XmueoKE2dHE5Onu5FUNM0ilgdr+umgA SGmpACQH3adeH5e5CZbdC22sZzlm2Kg= ARC-Authentication-Results: i=1; imf08.hostedemail.com; dkim=none; spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com; dmarc=pass (policy=none) header.from=arm.com Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 65FDF1007; Tue, 17 Sep 2024 03:19:54 -0700 (PDT) Received: from [10.57.83.157] (unknown [10.57.83.157]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 326183F64C; Tue, 17 Sep 2024 03:19:23 -0700 (PDT) Message-ID: <58f91a56-890a-45d0-8b1f-47c4c70c9600@arm.com> Date: Tue, 17 Sep 2024 11:19:21 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH] mm: Compute mTHP order efficiently Content-Language: en-GB To: Barry Song Cc: Dev Jain , Matthew Wilcox , akpm@linux-foundation.org, david@redhat.com, anshuman.khandual@arm.com, hughd@google.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com, baolin.wang@linux.alibaba.com, gshan@redhat.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org References: <20240913091902.1160520-1-dev.jain@arm.com> <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com> From: Ryan Roberts In-Reply-To: Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Rspam-User: X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: B337516001D X-Stat-Signature: 6uh8zpjfd6ddityressr4k9hx8noxycn X-HE-Tag: 1726568365-498265 X-HE-Meta: U2FsdGVkX183Qsnt+Z4FK2AJf3K73/GYu+6+FcEd0vfp/a9Rl7i9mK1ZMeNulR6f6wSUW91SivY+XUvZJcaVbjKBXthyGu1i1NymdnHnqnUDDXjr4cQvIcw1tDbOIezne6tFwQpPWIUBgqHDoW9pr8NvnyYyTHA+JY4U1tHVrVDskAoCAXU+bbUUUYCgVZSWz1I5pvvaFHpjLEZeJum9bISygBgb/piI9PwnBHLq60cjOktq/nZV3RFom811xqDfyCMaWy04CtoOuhWlGilm1qqcpQDWtYw92NtzXev5IHmBCqT9zTbjJuUfuyYjKNEzf2ojP2M/5DqYS3pT7F7tQaPt4eJCLeMBHFrHLHwdl2Qb0r1K+DgyFM/Yed71GUIlufivA2o0O/izq9cVBvmQL0W/HePZyohABZko14NeJh4JFgVuHF6fbKDhkczgoTXQcQiNq46aqQP7XsKymNNoIYEQdwvwdHhJfmPhjWPdHPsDRchN/Kpjo4bkNsZd7e8k44PaWYkIUX63B5TzSpHNoZBOgG8jUPhjTBce7Db/MxOdVYjV/fpBKrAf5by/yu96xZkjO6yYg5sWsAm76gcqGcafMoAajrhI+trkGiPbTueINC26Uf9nEqXR1HZhZ0UMR7+UQTEOxhOvUCmWxgIXyCl5ibivXWgpdGso3wvIyEDGVAhFFoqBquVMFGEmCpQzYZZxWC950XBU7Xzwpn4V8rW13kTh3fm4WS+lqYFvfeHSHK8gR5WDEK+kha9rUfO5MbvPQsdG3WYsP3pUS3q47OPsJZfxETSW1jFcCkJlVCJp0OxuTpDHTAlywVUHMwCxImfgiYfixq+s/ErBxfaF402CGF8KF/wlgceI4VZQQ42uNxWsd28NDN+AjeYbPrpr5Qp4S3Z6eQXlySnOIR4tI3sCmW2H6fbkYT4IDcuHwAZSLSIy4tgHNlJNB0UT5uab7O7Zmlfife1SGb3+gfE 3GtyMsdS pORdTuOxc/dhsn8Ob2xw0MErdGP1iBkXVjEFYXr/L+BHNoLsIQolzSQctBwS3iJ+8Ggw9n6I6GNDfVZwBqAWu/tES16witW3CJFPKKMP7BBQfjaFC2Tu6xm3BKC2wwspRcudUK63f72yRJQgsJHv53cxNfMDnYsSXRIajaV0wDx4KHbHd7Uw2LY5FGFO3CksKO2RtGVf7//J6/i6+I5vTuoUjJ2m7l/OjH2s1MXb0lwjWwvgcDBzLLTD+zSMGiovU+cTleD9exXQOCr8SpWRN4/nLHQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 17/09/2024 10:09, Barry Song wrote: > On Tue, Sep 17, 2024 at 4:54 PM Ryan Roberts wrote: >> >> On 17/09/2024 09:44, Barry Song wrote: >>> On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts wrote: >>>> >>>> On 17/09/2024 04:55, Dev Jain wrote: >>>>> >>>>> On 9/16/24 18:54, Matthew Wilcox wrote: >>>>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote: >>>>>>> We use pte_range_none() to determine whether contiguous PTEs are empty >>>>>>> for an mTHP allocation. Instead of iterating the while loop for every >>>>>>> order, use some information, which is the first set PTE found, from the >>>>>>> previous iteration, to eliminate some cases. The key to understanding >>>>>>> the correctness of the patch is that the ranges we want to examine >>>>>>> form a strictly decreasing sequence of nested intervals. >>>>>> This is a lot more complicated. Do you have any numbers that indicate >>>>>> that it's faster? Yes, it's fewer memory references, but you've gone >>>>>> from a simple linear scan that's easy to prefetch to an exponential scan >>>>>> that might confuse the prefetchers. >>>>> >>>>> I do have some numbers, I tested with a simple program, and also used >>>>> ktime API, with the latter, enclosing from "order = highest_order(orders)" >>>>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough average >>>>> estimate is that without the patch, it takes 1700 ns to execute, with the >>>>> patch, on an average it takes 80 - 100ns less. I cannot think of a good >>>>> testing program... >>>>> >>>>> For the prefetching thingy, I am still doing a linear scan, and in each >>>>> iteration, with the patch, the range I am scanning is going to strictly >>>>> lie inside the range I would have scanned without the patch. Won't the >>>>> compiler and the CPU still do prefetching, but on a smaller range; where >>>>> does the prefetcher get confused? I confess, I do not understand this >>>>> very well. >>>>> >>>> >>>> A little history on this; My original "RFC v2" for mTHP included this >>>> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which >>>> I did. Then at v8, DavidH suggested we could benefit from this sort of >>>> optimization, but we agreed to do it later as a separate change [2]: >>>> >>>> """ >>>>>> Comment: Likely it would make sense to scan only once and determine the >>>>>> "largest none range" around that address, having the largest suitable order >>>>>> in mind. >>>>> >>>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this, >>>>> IIRC. Perhaps this an optimization opportunity for later? >>>> >>>> Yes, definetly. >>>> """ >>>> >>>> Dev independently discovered this opportunity while reading the code, and I >>>> pointed him to the history, and suggested it would likely be worthwhile to send >>>> a patch. >>>> >>>> My view is that I don't see how this can harm performance; in the common case, >>>> when a single order is enabled, this is essentially the same as before. But when >>>> there are multiple orders enabled, we are now just doing a single linear scan of >>>> the ptes rather than multiple scans. There will likely be some stack accesses >>>> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference >>>> between the stack and other areas of memory. >>>> >>>> Perhaps some perf numbers would help; I think the simplest way to gather some >>>> numbers would be to create a microbenchmark to allocate a large VMA, then fault >>>> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K, >>>> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but >>>> this patch will show improvement if the scan is currently a significant portion >>>> of the page fault. >>>> >>>> If the proposed benchmark shows an improvement, and we don't see any regression >>>> when only enabling 64K, then my vote would be to accept the patch. >>> >>> Agreed. The challenge now is how to benchmark this. In a system >>> without fragmentation, >>> we consistently succeed in allocating the largest size (1MB). >>> Therefore, we need an >>> environment where allocations of various sizes can fail proportionally, allowing >>> pte_range_none() to fail on larger sizes but succeed on smaller ones. >> >> I don't think this is about allocation failure? It's about finding a folio order >> that fits into the VMA without overlapping any already non-none PTEs. >> >>> >>> It seems we can't micro-benchmark this with a small program. >> >> My proposal was to deliberately fault in a single (4K) page every 128K. That >> will cause the scanning logic to reduce the order to the next lowest enabled >> order and try again. So with the current code, for all orders {1M, 512K, 256K, >> 128K} you would scan the first 128K of ptes (32 entries) then for 64K you would >> scan 16 entries = 4 * 32 + 16 = 144 entries. For the new change, you would just >> scan 32 entries. > > I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you > fault in a single 4KB page every 128KB with 1MB enabled, you'd still succeed > in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be > no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)? Sorry I'm not following this. - start with all mTHP orders disabled. - mmap a 1G region, which is 1G aligned. - write a single byte every 128K throughout the VMA. - causes 1 4K page to be mapped every 32 pages; - 1x4K-present, 31x4K-none, 1x4K-present, 31x4K-none, ... - enable mTHP orders {1M, 512K, 256K, 128K, 64K, 32K, 16K} - madvise(MADV_HUGEPAGE) - write single byte every 4K thoughout the VMA. - causes biggest possible mTHP orders to be allocated in the 31x4K holes - 4x4K, 1x16K, 1x32K, 1x64K, 4x4K, 1x16K, 1x32K, 1x64K Perhaps I didn't make it clear that mTHP would be disabled during the 4K "pre-faulting" phase, then enabled for the "post-faulting" phase? > > For the entire 4MB VMA, we only have 4 page faults? For each page, we scan > 256 entries, and there's no way to scan the next possible larger order, like > 512KB if 1MB has succeeded? > > My point is that we need to make the 1MB allocation fail in order to disrupt the > continuity of pte_none(). otherwise, pte_range_none() will return true for the > largest order. But we can simulate that by putting single 4K entries strategicly in the pgtable. > >> >> Although now that I've actually written that down, it doesn't feel like a very >> big win. Perhaps Dev can come up with an even more contrived single-page >> pre-allocation pattern that will maximise the number of PTEs we hit with the >> current code, and minimise it with the new code :) >> >>> >>>> >>>> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@arm.com/ >>>> [2] >>>> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com/ >>>> >>>> Thanks, >>>> Ryan >> > > Thanks > Barry