From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 1FC2AC35FF9
	for <linux-mm@archiver.kernel.org>; Tue, 17 Sep 2024 08:54:53 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 5FD9C6B0082; Tue, 17 Sep 2024 04:54:52 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 5D3C76B0088; Tue, 17 Sep 2024 04:54:52 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 474C26B0089; Tue, 17 Sep 2024 04:54:52 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 2A3A96B0082
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 04:54:52 -0400 (EDT)
Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 966A0C0243
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 08:54:51 +0000 (UTC)
X-FDA: 82573619982.28.61C42DB
Received: from foss.arm.com (foss.arm.com [217.140.110.172])
	by imf15.hostedemail.com (Postfix) with ESMTP id E0B79A0019
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 08:54:49 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=none;
	spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1726563258;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=3g4VGsjHXGF5Tzn8jNv0TrwFDiKMKyL5dJsmPWcdKdI=;
	b=Glo6sGRCKSztgS9kCNyop9WVSk+fYUVi8VTXr60AjcinpbbavrlYnjy4meWN4E8v238dq0
	X0PLUcj57fo7hkZIaAXSAtM910FltzKjDf5Te//ykn0NKO2JAV17WHl+tfqAZzqLbFDya+
	S9wA9VUVjMTP5tQouHrQ82a0hupA5uU=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=none;
	spf=pass (imf15.hostedemail.com: domain of ryan.roberts@arm.com designates 217.140.110.172 as permitted sender) smtp.mailfrom=ryan.roberts@arm.com;
	dmarc=pass (policy=none) header.from=arm.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726563258; a=rsa-sha256;
	cv=none;
	b=rySQv6ifm/6TaU5fdn+ts90DhENjil6MIIwsFrufP2G5Hko9oAN2jQquYZ3roAqlj4BB16
	j6eqvnka/HD/UDikVZsFE16okl78lWSI3t3jkhxLWlof1oL8qIJFAksnD81OrZFAsQC6X/
	Qod/FcaUpZbzsbl70Zn2weXLCbf12iI=
Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14])
	by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 61199FEC;
	Tue, 17 Sep 2024 01:55:18 -0700 (PDT)
Received: from [10.57.83.157] (unknown [10.57.83.157])
	by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 45A263F64C;
	Tue, 17 Sep 2024 01:54:47 -0700 (PDT)
Message-ID: <b49a9f8d-b871-4164-98a8-fda70a789f30@arm.com>
Date: Tue, 17 Sep 2024 09:54:45 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
Content-Language: en-GB
To: Barry Song <baohua@kernel.org>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>,
 akpm@linux-foundation.org, david@redhat.com, anshuman.khandual@arm.com,
 hughd@google.com, ioworker0@gmail.com, wangkefeng.wang@huawei.com,
 baolin.wang@linux.alibaba.com, gshan@redhat.com,
 linux-kernel@vger.kernel.org, linux-mm@kvack.org
References: <20240913091902.1160520-1-dev.jain@arm.com>
 <ZugxqJ-CjEi5lEW_@casper.infradead.org>
 <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com>
 <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
 <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com>
From: Ryan Roberts <ryan.roberts@arm.com>
In-Reply-To: <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Stat-Signature: 9r9kyp5ptya4xgipranbc48uoogb9t56
X-Rspamd-Queue-Id: E0B79A0019
X-Rspamd-Server: rspam11
X-HE-Tag: 1726563289-767699
X-HE-Meta: U2FsdGVkX1849C3PrQ0R0OrCpTSfSOLFaUvKlvdvuCLiM8wDAAvJLSXb855wqa0gohzLgDBjaRzhNAaA5+VUmUetYB2RZRj1GHQv2f28HMLi9zw2nCZkVNJcjwk0yN1ljiFxeXDMh5o4NueRdUbzyZ1D7XsVgPyKT1aOSALsim8g1jdfCnHf8fQfm51rMYvgEjBt46q3YsIYhqpYHWCNVQlnckJqyfVPii79Po7AlrIvdUvOaaTGjzZ0hSKd47V955M5TMmO0scDKt7qLdINIE+DpanYhNKtqJMH8BuMjOzxbCKyj5CfEbloG486rr8zIXuvX5Isz0dOK5X8aehKIzWeA9Xbj6iKTGP7DFXtk4q7XLd/hdiEGUNGaFHKxHLGjzjZ41JMjBKa29l5ttwxGrfkutszwQxbhpyV5PN9OUfmIJqGApfwNxxOmIVTouM0oN83YTjI5lKVaxRhSpXVmdgsmdZggYgF0bAN1+pQi0aS9XxqQvDmneRLeJyTzqFFqXEqVhrzlyIV4zGhMyuX9wW+V689dPLa8gbxDIgwVMj5C3CTTaZxD9eo/tnO20vLfL3DNVqxGc+B/oI8rJD4khHd3n8N99qQ9y8NxboJYe6RdhUPVdIWa4d3LuWRgYpbwZIS8Ej0j3Ya5J9bYeYz247NrtJTI8+GApW59upxQxcFzzyBAOi0O5c8QeXDuAikORTXBMTtpyHxPgAIxQz0h/rjr24ZBYhjAsYkt68HeAHI2r8awlIHqRm9xvJ8jKkx+LUChVY7GxaPSfgYCcdIYi2DGDy9NEhlGRspcI6g7vP0YPSibVmUrGWMaL+ocASrLCJxW0w7Vclj2+BFZUUgC1AO2U15g/GKjdeQdWJzwgg2D74GckNvoUTta0zXVZR8nIBFLL2N+EViYKz6lPUjp4rjbqj7EEn0xoFzm+KVpxq1xpVyU7uRgXJN7OjqNyH/JV9/r9S8MutownPYW8K
 z0jevPFv
 cNALf1ohDRcp0xWB6FAjtLBhGLAqZOs3mgpTiJ1FeybdQ7022whPRLjr5tLRxlFNk0dWW2phFODlV1g1yEu/1HGsCSeLRLtVXp7wEKiFY+P4qQ8G0KebT1ekRMuJRlB8VYhXNAfWfmE1Pxz/0hpl+3tehSoY+Od1v3Oc8GuPG/duedlOekckbUaNyEAeiohwgX0HEumGuIPWeVJVShPh8ZIpbicnTq8j5XJf3cJ71HMBz3ZbD0WMfr/LWxrYhC5aAgSgENcfUlk7W+0objGMaruQ3lw==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On 17/09/2024 09:44, Barry Song wrote:
> On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 17/09/2024 04:55, Dev Jain wrote:
>>>
>>> On 9/16/24 18:54, Matthew Wilcox wrote:
>>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
>>>>> We use pte_range_none() to determine whether contiguous PTEs are empty
>>>>> for an mTHP allocation. Instead of iterating the while loop for every
>>>>> order, use some information, which is the first set PTE found, from the
>>>>> previous iteration, to eliminate some cases. The key to understanding
>>>>> the correctness of the patch is that the ranges we want to examine
>>>>> form a strictly decreasing sequence of nested intervals.
>>>> This is a lot more complicated.  Do you have any numbers that indicate
>>>> that it's faster?  Yes, it's fewer memory references, but you've gone
>>>> from a simple linear scan that's easy to prefetch to an exponential scan
>>>> that might confuse the prefetchers.
>>>
>>> I do have some numbers, I tested with a simple program, and also used
>>> ktime API, with the latter, enclosing from "order = highest_order(orders)"
>>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
>>> estimate is that without the patch, it takes 1700 ns to execute, with the
>>> patch, on an average it takes 80 - 100ns less. I cannot think of a good
>>> testing program...
>>>
>>> For the prefetching thingy, I am still doing a linear scan, and in each
>>> iteration, with the patch, the range I am scanning is going to strictly
>>> lie inside the range I would have scanned without the patch. Won't the
>>> compiler and the CPU still do prefetching, but on a smaller range; where
>>> does the prefetcher get confused? I confess, I do not understand this
>>> very well.
>>>
>>
>> A little history on this; My original "RFC v2" for mTHP included this
>> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
>> I did. Then at v8, DavidH suggested we could benefit from this sort of
>> optimization, but we agreed to do it later as a separate change [2]:
>>
>> """
>>>> Comment: Likely it would make sense to scan only once and determine the
>>>> "largest none range" around that address, having the largest suitable order
>>>> in mind.
>>>
>>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
>>> IIRC. Perhaps this an optimization opportunity for later?
>>
>> Yes, definetly.
>> """
>>
>> Dev independently discovered this opportunity while reading the code, and I
>> pointed him to the history, and suggested it would likely be worthwhile to send
>> a patch.
>>
>> My view is that I don't see how this can harm performance; in the common case,
>> when a single order is enabled, this is essentially the same as before. But when
>> there are multiple orders enabled, we are now just doing a single linear scan of
>> the ptes rather than multiple scans. There will likely be some stack accesses
>> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
>> between the stack and other areas of memory.
>>
>> Perhaps some perf numbers would help; I think the simplest way to gather some
>> numbers would be to create a microbenchmark to allocate a large VMA, then fault
>> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
>> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
>> this patch will show improvement if the scan is currently a significant portion
>> of the page fault.
>>
>> If the proposed benchmark shows an improvement, and we don't see any regression
>> when only enabling 64K, then my vote would be to accept the patch.
> 
> Agreed. The challenge now is how to benchmark this. In a system
> without fragmentation,
> we consistently succeed in allocating the largest size (1MB).
> Therefore, we need an
> environment where allocations of various sizes can fail proportionally, allowing
> pte_range_none() to fail on larger sizes but succeed on smaller ones.

I don't think this is about allocation failure? It's about finding a folio order
that fits into the VMA without overlapping any already non-none PTEs.

> 
> It seems we can't micro-benchmark this with a small program.

My proposal was to deliberately fault in a single (4K) page every 128K. That
will cause the scanning logic to reduce the order to the next lowest enabled
order and try again. So with the current code, for all orders {1M, 512K, 256K,
128K} you would scan the first 128K of ptes (32 entries) then for 64K you would
scan 16 entries = 4 * 32 + 16 = 144 entries. For the new change, you would just
scan 32 entries.

Although now that I've actually written that down, it doesn't feel like a very
big win. Perhaps Dev can come up with an even more contrived single-page
pre-allocation pattern that will maximise the number of PTEs we hit with the
current code, and minimise it with the new code :)

> 
>>
>> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@arm.com/
>> [2]
>> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com/
>>
>> Thanks,
>> Ryan