From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id CD41FC36008
	for <linux-mm@archiver.kernel.org>; Tue, 17 Sep 2024 10:19:28 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6425C6B0088; Tue, 17 Sep 2024 06:19:28 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5C9696B0089; Tue, 17 Sep 2024 06:19:28 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 443536B008A; Tue, 17 Sep 2024 06:19:28 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11])
	by kanga.kvack.org (Postfix) with ESMTP id 2236C6B0088
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 06:19:28 -0400 (EDT)
Received: from smtpin21.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 8BDC4802D5
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 10:19:27 +0000 (UTC)
X-FDA: 82573833174.21.E0921AC
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf08.hostedemail.com (Postfix) with ESMTP id B337516001D
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 10:19:25 +0000 (UTC)
Authentication-Results: imf08.hostedemail.com;
	dkim=none;
	spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1726568255;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=GjjUu5SNjDfFCpPr7b2VICxXCUL34O+GDzr3nri9SKM=;
	b=eKXSZ8eMLDsSaQ4FSpXSU/vrCI9ndFFgMQ3dX4x9RE+AoqEli7tEAtLQuiO5pnkfgVtn9P
	eqMz7dPDQbPeZflqYWGK+uvFHyAeiCcjwRWleKK0EdLFlltITGa76OxMTXHskwoUbAVsKR
	J8gEFJU/LIxcXwlohIETgORw+/0Q/IM=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726568255; a=rsa-sha256;
	cv=none;
	b=YpQF7BRGoPOM3NFbqN1nybKX4OD0lYF/ToEcFvUHABmkMUyD+lk1olj1f6RDWSe3RP1vqx
	x0ePp3y/btWwEuKjtpNiTB1FDZFbxbpId4HxtS/XmueoKE2dHE5Onu5FUNM0ilgdr+umgA
	SGmpACQH3adeH5e5CZbdC22sZzlm2Kg=
ARC-Authentication-Results: i=1;
	imf08.hostedemail.com;
	dkim=none;
	spf=pass (imf08.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 65FDF1007;
	Tue, 17 Sep 2024 03:19:54 -0700 (PDT)
Received: from [10.57.83.157] (unknown [10.57.83.157])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 326183F64C;
	Tue, 17 Sep 2024 03:19:23 -0700 (PDT)
Message-ID: <58f91a56-890a-45d0-8b1f-47c4c70c9600@arm.com>
Date: Tue, 17 Sep 2024 11:19:21 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
Content-Language: en-GB
To: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>,
 akpm@linux-foundation.org, david@redhat.com, anshuman.khandual@arm.com,
 hughd@google.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 baolin.wang@linux.alibaba.com, gshan@redhat.com,
 linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <20240913091902.1160520-1-dev.jain@arm.com>
 <ZugxqJ-CjEi5lEW_@casper.infradead.org>
 <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com>
 <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
 <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com>
 <b49a9f8d-b871-4164-98a8-fda70a789f30@arm.com>
 <CAGsJ_4xaynacUy46esq9R2XijGWWYO4uAk3651BUQw4hpbnJtw@mail.gmail.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <CAGsJ_4xaynacUy46esq9R2XijGWWYO4uAk3651BUQw4hpbnJtw@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: B337516001D
X-Stat-Signature: 6uh8zpjfd6ddityressr4k9hx8noxycn
X-HE-Tag: 1726568365-498265
X-HE-Meta: U2FsdGVkX183Qsnt+Z4FK2AJf3K73/GYu+6+FcEd0vfp/a9Rl7i9mK1ZMeNulR6f6wSUW91SivY+XUvZJcaVbjKBXthyGu1i1NymdnHnqnUDDXjr4cQvIcw1tDbOIezne6tFwQpPWIUBgqHDoW9pr8NvnyYyTHA+JY4U1tHVrVDskAoCAXU+bbUUUYCgVZSWz1I5pvvaFHpjLEZeJum9bISygBgb/piI9PwnBHLq60cjOktq/nZV3RFom811xqDfyCMaWy04CtoOuhWlGilm1qqcpQDWtYw92NtzXev5IHmBCqT9zTbjJuUfuyYjKNEzf2ojP2M/5DqYS3pT7F7tQaPt4eJCLeMBHFrHLHwdl2Qb0r1K+DgyFM/Yed71GUIlufivA2o0O/izq9cVBvmQL0W/HePZyohABZko14NeJh4JFgVuHF6fbKDhkczgoTXQcQiNq46aqQP7XsKymNNoIYEQdwvwdHhJfmPhjWPdHPsDRchN/Kpjo4bkNsZd7e8k44PaWYkIUX63B5TzSpHNoZBOgG8jUPhjTBce7Db/MxOdVYjV/fpBKrAf5by/yu96xZkjO6yYg5sWsAm76gcqGcafMoAajrhI+trkGiPbTueINC26Uf9nEqXR1HZhZ0UMR7+UQTEOxhOvUCmWxgIXyCl5ibivXWgpdGso3wvIyEDGVAhFFoqBquVMFGEmCpQzYZZxWC950XBU7Xzwpn4V8rW13kTh3fm4WS+lqYFvfeHSHK8gR5WDEK+kha9rUfO5MbvPQsdG3WYsP3pUS3q47OPsJZfxETSW1jFcCkJlVCJp0OxuTpDHTAlywVUHMwCxImfgiYfixq+s/ErBxfaF402CGF8KF/wlgceI4VZQQ42uNxWsd28NDN+AjeYbPrpr5Qp4S3Z6eQXlySnOIR4tI3sCmW2H6fbkYT4IDcuHwAZSLSIy4tgHNlJNB0UT5uab7O7Zmlfife1SGb3+gfE
 3GtyMsdS
 pORdTuOxc/dhsn8Ob2xw0MErdGP1iBkXVjEFYXr/L+BHNoLsIQolzSQctBwS3iJ+8Ggw9n6I6GNDfVZwBqAWu/tES16witW3CJFPKKMP7BBQfjaFC2Tu6xm3BKC2wwspRcudUK63f72yRJQgsJHv53cxNfMDnYsSXRIajaV0wDx4KHbHd7Uw2LY5FGFO3CksKO2RtGVf7//J6/i6+I5vTuoUjJ2m7l/OjH2s1MXb0lwjWwvgcDBzLLTD+zSMGiovU+cTleD9exXQOCr8SpWRN4/nLHQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 17/09/2024 10:09, Barry Song wrote:
> On Tue, Sep 17, 2024 at 4:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 17/09/2024 09:44, Barry Song wrote:
>>> On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 17/09/2024 04:55, Dev Jain wrote:
>>>>>
>>>>> On 9/16/24 18:54, Matthew Wilcox wrote:
>>>>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
>>>>>>> We use pte_range_none() to determine whether contiguous PTEs are empty
>>>>>>> for an mTHP allocation. Instead of iterating the while loop for every
>>>>>>> order, use some information, which is the first set PTE found, from the
>>>>>>> previous iteration, to eliminate some cases. The key to understanding
>>>>>>> the correctness of the patch is that the ranges we want to examine
>>>>>>> form a strictly decreasing sequence of nested intervals.
>>>>>> This is a lot more complicated.  Do you have any numbers that indicate
>>>>>> that it's faster?  Yes, it's fewer memory references, but you've gone
>>>>>> from a simple linear scan that's easy to prefetch to an exponential scan
>>>>>> that might confuse the prefetchers.
>>>>>
>>>>> I do have some numbers, I tested with a simple program, and also used
>>>>> ktime API, with the latter, enclosing from "order = highest_order(orders)"
>>>>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
>>>>> estimate is that without the patch, it takes 1700 ns to execute, with the
>>>>> patch, on an average it takes 80 - 100ns less. I cannot think of a good
>>>>> testing program...
>>>>>
>>>>> For the prefetching thingy, I am still doing a linear scan, and in each
>>>>> iteration, with the patch, the range I am scanning is going to strictly
>>>>> lie inside the range I would have scanned without the patch. Won't the
>>>>> compiler and the CPU still do prefetching, but on a smaller range; where
>>>>> does the prefetcher get confused? I confess, I do not understand this
>>>>> very well.
>>>>>
>>>>
>>>> A little history on this; My original "RFC v2" for mTHP included this
>>>> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
>>>> I did. Then at v8, DavidH suggested we could benefit from this sort of
>>>> optimization, but we agreed to do it later as a separate change [2]:
>>>>
>>>> """
>>>>>> Comment: Likely it would make sense to scan only once and determine the
>>>>>> "largest none range" around that address, having the largest suitable order
>>>>>> in mind.
>>>>>
>>>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>>>>> IIRC. Perhaps this an optimization opportunity for later?
>>>>
>>>> Yes, definetly.
>>>> """
>>>>
>>>> Dev independently discovered this opportunity while reading the code, and I
>>>> pointed him to the history, and suggested it would likely be worthwhile to send
>>>> a patch.
>>>>
>>>> My view is that I don't see how this can harm performance; in the common case,
>>>> when a single order is enabled, this is essentially the same as before. But when
>>>> there are multiple orders enabled, we are now just doing a single linear scan of
>>>> the ptes rather than multiple scans. There will likely be some stack accesses
>>>> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
>>>> between the stack and other areas of memory.
>>>>
>>>> Perhaps some perf numbers would help; I think the simplest way to gather some
>>>> numbers would be to create a microbenchmark to allocate a large VMA, then fault
>>>> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
>>>> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
>>>> this patch will show improvement if the scan is currently a significant portion
>>>> of the page fault.
>>>>
>>>> If the proposed benchmark shows an improvement, and we don't see any regression
>>>> when only enabling 64K, then my vote would be to accept the patch.
>>>
>>> Agreed. The challenge now is how to benchmark this. In a system
>>> without fragmentation,
>>> we consistently succeed in allocating the largest size (1MB).
>>> Therefore, we need an
>>> environment where allocations of various sizes can fail proportionally, allowing
>>> pte_range_none() to fail on larger sizes but succeed on smaller ones.
>>
>> I don't think this is about allocation failure? It's about finding a folio order
>> that fits into the VMA without overlapping any already non-none PTEs.
>>
>>>
>>> It seems we can't micro-benchmark this with a small program.
>>
>> My proposal was to deliberately fault in a single (4K) page every 128K. That
>> will cause the scanning logic to reduce the order to the next lowest enabled
>> order and try again. So with the current code, for all orders {1M, 512K, 256K,
>> 128K} you would scan the first 128K of ptes (32 entries) then for 64K you would
>> scan 16 entries = 4 * 32 + 16 = 144 entries. For the new change, you would just
>> scan 32 entries.
> 
> I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you
> fault in a single 4KB page every 128KB with 1MB enabled, you'd still succeed
> in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be
> no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)?

Sorry I'm not following this.

 - start with all mTHP orders disabled.
 - mmap a 1G region, which is 1G aligned.
 - write a single byte every 128K throughout the VMA.
    - causes 1 4K page to be mapped every 32 pages;
    - 1x4K-present, 31x4K-none, 1x4K-present, 31x4K-none, ...
 - enable mTHP orders {1M, 512K, 256K, 128K, 64K, 32K, 16K}
 - madvise(MADV_HUGEPAGE)
 - write single byte every 4K thoughout the VMA.
    - causes biggest possible mTHP orders to be allocated in the 31x4K holes
    - 4x4K, 1x16K, 1x32K, 1x64K, 4x4K, 1x16K, 1x32K, 1x64K

Perhaps I didn't make it clear that mTHP would be disabled during the 4K
"pre-faulting" phase, then enabled for the "post-faulting" phase?

> 
> For the entire 4MB VMA, we only have 4 page faults? For each page, we scan
> 256 entries, and there's no way to scan the next possible larger order, like
> 512KB if 1MB has succeeded?
> 
> My point is that we need to make the 1MB allocation fail in order to disrupt the
> continuity of pte_none(). otherwise, pte_range_none() will return true for the
> largest order.

But we can simulate that by putting single 4K entries strategicly in the pgtable.

> 
>>
>> Although now that I've actually written that down, it doesn't feel like a very
>> big win. Perhaps Dev can come up with an even more contrived single-page
>> pre-allocation pattern that will maximise the number of PTEs we hit with the
>> current code, and minimise it with the new code :)
>>
>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@arm.com/
>>>> [2]
>>>> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com/
>>>>
>>>> Thanks,
>>>> Ryan
>>
> 
> Thanks
> Barry