From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id B7AA9E7716C
	for <linux-mm@archiver.kernel.org>; Thu,  5 Dec 2024 15:20:03 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 801EB6B00E8; Thu,  5 Dec 2024 10:19:11 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 75F856B00CE; Thu,  5 Dec 2024 10:19:10 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 2BE486B00AB; Thu,  5 Dec 2024 10:19:08 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 2A5996B008A
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 05:09:25 -0400 (EDT)
Received: from smtpin04.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id CD2871C23B0
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 09:09:24 +0000 (UTC)
X-FDA: 82573656648.04.D2B3A1C
Received: from mail-vk1-f177.google.com (mail-vk1-f177.google.com [209.85.221.177])
	by imf09.hostedemail.com (Postfix) with ESMTP id 03E9B140011
	for <linux-mm@kvack.org>; Tue, 17 Sep 2024 09:09:22 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=none;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1726564107;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=k7Z5g/rm/8c1AfkK8S33MGBmeLoJ7CID73lMAU/7mCs=;
	b=WujC3/DmgJvc/oXRdyucvNoPUrSBxojLohIALOEs5GmUDrhJoDhP0Yxit4HQYWBmprpfKm
	yiu3GToWdP/9uC0IrqLVmDtlEfRr112H2yEyPAVczAGLRLGHdXibmVXr7XeZ2QOkylxjii
	jNQxX4oP+wNMmNzmWQFCkLrnlYCZqH4=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=none;
	spf=pass (imf09.hostedemail.com: domain of 21cnbao@gmail.com designates 209.85.221.177 as permitted sender) smtp.mailfrom=21cnbao@gmail.com;
	dmarc=fail reason="SPF not aligned (relaxed), No valid DKIM" header.from=kernel.org (policy=quarantine)
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726564107; a=rsa-sha256;
	cv=none;
	b=PovaaJxGUcrWNicAltML+GtQEIkKQDOjkm9KL3MvtHXy5YX9vTuqUqqMApxtNhSnnsg3AA
	GLvTJjvKQCMfMJ+CjWcpNsTsgluPJil66qydtwZfh3JCikgZxosc9QT6/7+VsmsKL4oZrl
	Ze2ddXCylRDvHggE8BNPRpN63VQ+OYA=
Received: by mail-vk1-f177.google.com with SMTP id 71dfb90a1353d-50313dc6d74so1481793e0c.2
        for <linux-mm@kvack.org>; Tue, 17 Sep 2024 02:09:22 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1726564162; x=1727168962;
        h=content-transfer-encoding:cc:to:subject:message-id:date:from
         :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=k7Z5g/rm/8c1AfkK8S33MGBmeLoJ7CID73lMAU/7mCs=;
        b=hSQJ7/EE4GcDend28NhVaWD85K8K1i4xVrKRucf4IRhHM1e/BhjpviExXQdKGLF4cw
         5YmpLqGcA1TtC+Ycqj0rCDet2KhYuCiKVk2I8pERtuQ/y1i2CzgnKKiyfhigEO7bvfiN
         wrrAHyKe8ubdqx5RydI92Smk2iXsGXLPUUU4/7aRB0lIGWi5hLnDbVcXmFABiNaoMlwW
         f4mW2FKBpIxmLlagWB4zm7Nbf/cKB+GP0GZyGmHDtUOnqQLSF5D3FH4I1PWQbks35qsd
         PWGh1WsJJpvrVqw6Qti/6Jd0BJhmC5LiBKsuFV2C3y1deSW+PHW7mSw9zKkf2CxO4u5U
         fftQ==
X-Forwarded-Encrypted: i=1; AJvYcCXAZ2UI97mM/dO/NP99B16MP4QzGVKVWNibn4CzYToU9inaTaorBZCvaU3Kpha7HY8S5YJxw/XFtw==@kvack.org
X-Gm-Message-State: AOJu0Yw1dOfBPrH1zn9CDIVOmdiBG/rXJChOocMifswhIsCn0xdIwghf
	8PA2/fW3ZgSLRZ3+cd8EwQgOfZru+LlgMaF/QK4egD1uDNmH9SuQTAA8uwBVz1g2aE3Fc1wZIzv
	/QHxnX3h+tv0V5vNWquLiY3t7xCk=
X-Google-Smtp-Source: AGHT+IHa8G6luBIHUH9nL2T90KANNf/r7Y/dD6pXxp1/SiFVkw4wuFMSC+gtR0GyRMK72+FRCTbDxU+MRNGZS3Nr0ks=
X-Received: by 2002:a05:6122:2028:b0:502:af87:1a91 with SMTP id
 71dfb90a1353d-5032d4d6539mr14356885e0c.10.1726564161966; Tue, 17 Sep 2024
 02:09:21 -0700 (PDT)
MIME-Version: 1.0
References: <20240913091902.1160520-1-dev.jain@arm.com> <ZugxqJ-CjEi5lEW_@casper.infradead.org>
 <091f517d-e7dc-4c10-b1ac-39658f31f0ed@arm.com> <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
 <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com> <b49a9f8d-b871-4164-98a8-fda70a789f30@arm.com>
In-Reply-To: <b49a9f8d-b871-4164-98a8-fda70a789f30@arm.com>
From: Barry Song <baohua@kernel.org>
Date: Tue, 17 Sep 2024 17:09:08 +0800
Message-ID: <CAGsJ_4xaynacUy46esq9R2XijGWWYO4uAk3651BUQw4hpbnJtw@mail.gmail.com>
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>, akpm@linux-foundation.org, 
	david@redhat.com, anshuman.khandual@arm.com, hughd@google.com, 
	ioworker0@gmail.com, wangkefeng.wang@huawei.com, 
	baolin.wang@linux.alibaba.com, gshan@redhat.com, linux-kernel@vger.kernel.org, 
	linux-mm@kvack.org
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
X-Rspamd-Server: rspam03
X-Rspam-User: 
X-Stat-Signature: ee59tn3i8wdw91fm8u4gebppmhbdiquc
X-Rspamd-Queue-Id: 03E9B140011
X-Rspamd-Pre-Result: action=add header;
	module=dmarc;
	Action set by DMARC
X-Rspam: Yes
X-HE-Tag: 1726564162-855693
X-HE-Meta: U2FsdGVkX1/xaxVoANNuFzYWNL2engbMLfSwLzTnmEyPy4jONnnefGXLtqNEUrzP08A/oNtdaJIh2B/999M3iLPecoSKT97Qiq2nszcl5xfMYt/4j2m6MCvor1JlqLFjuPAyVi8b//9temtx/6qX2edfOrsMa9/uhewklNpNm2yYj2FJmUEyFgSDXK48PrEk/oAcXxdb0jPbK4ETeApuj4w06YsU9f3a8X0t1pznS0OSzJ5kEYI1gnYFM8oBUDUe2s9/zduNzYx9P8HM0YQUhV61aEUuryPo/43ivUPgFImf7dmwX5MjaMtbZfHC8Q0Ha9bPdqhv39EEjVaNXW4GrqRWeVXLgy7XyD8rzmsH2Uw/UhaHo9avsOYrrr9HlsSBYLUCPc6xlRhMbl7vfNqfHgZ9XrJQKiBMK6Zu4BWUOJEUZTxu/yD4hwrsTohCkxULwyiwacZIwlDNmffkiAiD4LnE+j4rSRN0T8cNvIP5E1IBkPePdhFHCSfTBrrZ8P6V3uiFI5n9cXs4p/sL7Cp5ezZIcj1y98dEVzZ8tFsOqJmsMWOXY0TDpPnwP31gWV+pGHL8e6rdSI4fO2nmz5fXRvcz8u0kSrtdbBQMrW93VRt7YyYqpDlaR9KUtABxg/I/ltG1melPu7MhWbg1i41xBk2A1+3oNIRI4GnCubt6Uv2ojwIeEeNx708xbIQSoTXvbjY9FedNONipG6HuhnUzudX2w39S72diyYD4Myct45HPIdKeczMyHyOPwD4EgVQM2qaAWr2FD22Q2DswWgThW6FdXr+3bqbw/AAGeQRcrkR+g72MEpalMKZ0H2i7bHxA9Y9cq6Zxj8gnwsYtFjwFZKETsvK7eTN4j/+iCCNTQOtjtUuR5i9k3izwSxhroQn0vnrgDfjnM2tZLl95ik557sxqtw/m2EGM/vofYSXAI0mToBM1PexcZ8aacNZo6pBZgD2lBdtZTX5i8BZljCg
 OBkfuuA4
 yUrKzbSDvZQWrc630UReWgCuyk8eC/nksPJ87X75AQX9flZAv1s1eJquJ4DHWU7sZE/HirJHyeFLGsACHvIjPU4wNfZyPAd/pdestBdmk09r+QM1IaXuPpBqc7t9I0wg0MmZZ8eWYiWggRpnKVthjLC3pQNcic0yDRMzL0r7ku9wiqrAryoIQPz3UujBUHphzCOOCDblXTz1CG/oU4MevjaMWRfUP/zRi58QlzWtjisLtFIi417N5zWv7GLOhKEvoBQkjf1oeJKHsjmYOY0v7NEPQuPH5wzz8q7YXu7W0QjBYw3IONhvMzxWNn50y0cipZii0BdCnUbqPbWlGMkvXKkGTMSQM5/0Q69AyqgS2b6K9qcUm8Xg0DpRLaIJN+MV+aA5Gfhh3zERFprA=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.002492, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Tue, Sep 17, 2024 at 4:54=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.com>=
 wrote:
>
> On 17/09/2024 09:44, Barry Song wrote:
> > On Tue, Sep 17, 2024 at 4:29=E2=80=AFPM Ryan Roberts <ryan.roberts@arm.=
com> wrote:
> >>
> >> On 17/09/2024 04:55, Dev Jain wrote:
> >>>
> >>> On 9/16/24 18:54, Matthew Wilcox wrote:
> >>>> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
> >>>>> We use pte_range_none() to determine whether contiguous PTEs are em=
pty
> >>>>> for an mTHP allocation. Instead of iterating the while loop for eve=
ry
> >>>>> order, use some information, which is the first set PTE found, from=
 the
> >>>>> previous iteration, to eliminate some cases. The key to understandi=
ng
> >>>>> the correctness of the patch is that the ranges we want to examine
> >>>>> form a strictly decreasing sequence of nested intervals.
> >>>> This is a lot more complicated.  Do you have any numbers that indica=
te
> >>>> that it's faster?  Yes, it's fewer memory references, but you've gon=
e
> >>>> from a simple linear scan that's easy to prefetch to an exponential =
scan
> >>>> that might confuse the prefetchers.
> >>>
> >>> I do have some numbers, I tested with a simple program, and also used
> >>> ktime API, with the latter, enclosing from "order =3D highest_order(o=
rders)"
> >>> till "pte_unmap(pte)" (enclosing the entire while loop), a rough aver=
age
> >>> estimate is that without the patch, it takes 1700 ns to execute, with=
 the
> >>> patch, on an average it takes 80 - 100ns less. I cannot think of a go=
od
> >>> testing program...
> >>>
> >>> For the prefetching thingy, I am still doing a linear scan, and in ea=
ch
> >>> iteration, with the patch, the range I am scanning is going to strict=
ly
> >>> lie inside the range I would have scanned without the patch. Won't th=
e
> >>> compiler and the CPU still do prefetching, but on a smaller range; wh=
ere
> >>> does the prefetcher get confused? I confess, I do not understand this
> >>> very well.
> >>>
> >>
> >> A little history on this; My original "RFC v2" for mTHP included this
> >> optimization [1], but Yu Zhou suggested dropping it to keep things sim=
ple, which
> >> I did. Then at v8, DavidH suggested we could benefit from this sort of
> >> optimization, but we agreed to do it later as a separate change [2]:
> >>
> >> """
> >>>> Comment: Likely it would make sense to scan only once and determine =
the
> >>>> "largest none range" around that address, having the largest suitabl=
e order
> >>>> in mind.
> >>>
> >>> Yes, that's how I used to do it, but Yu Zhou requested simplifying to=
 this,
> >>> IIRC. Perhaps this an optimization opportunity for later?
> >>
> >> Yes, definetly.
> >> """
> >>
> >> Dev independently discovered this opportunity while reading the code, =
and I
> >> pointed him to the history, and suggested it would likely be worthwhil=
e to send
> >> a patch.
> >>
> >> My view is that I don't see how this can harm performance; in the comm=
on case,
> >> when a single order is enabled, this is essentially the same as before=
. But when
> >> there are multiple orders enabled, we are now just doing a single line=
ar scan of
> >> the ptes rather than multiple scans. There will likely be some stack a=
ccesses
> >> interleved, but I'd be gobsmacked if the prefetchers can't tell the di=
fference
> >> between the stack and other areas of memory.
> >>
> >> Perhaps some perf numbers would help; I think the simplest way to gath=
er some
> >> numbers would be to create a microbenchmark to allocate a large VMA, t=
hen fault
> >> in single pages at a given stride (say, 1 every 128K), then enable 1M,=
 512K,
> >> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contri=
ved, but
> >> this patch will show improvement if the scan is currently a significan=
t portion
> >> of the page fault.
> >>
> >> If the proposed benchmark shows an improvement, and we don't see any r=
egression
> >> when only enabling 64K, then my vote would be to accept the patch.
> >
> > Agreed. The challenge now is how to benchmark this. In a system
> > without fragmentation,
> > we consistently succeed in allocating the largest size (1MB).
> > Therefore, we need an
> > environment where allocations of various sizes can fail proportionally,=
 allowing
> > pte_range_none() to fail on larger sizes but succeed on smaller ones.
>
> I don't think this is about allocation failure? It's about finding a foli=
o order
> that fits into the VMA without overlapping any already non-none PTEs.
>
> >
> > It seems we can't micro-benchmark this with a small program.
>
> My proposal was to deliberately fault in a single (4K) page every 128K. T=
hat
> will cause the scanning logic to reduce the order to the next lowest enab=
led
> order and try again. So with the current code, for all orders {1M, 512K, =
256K,
> 128K} you would scan the first 128K of ptes (32 entries) then for 64K you=
 would
> scan 16 entries =3D 4 * 32 + 16 =3D 144 entries. For the new change, you =
would just
> scan 32 entries.

I'm a bit confused. If we have a VMA from 1GB to 1GB+4MB, and even if you
fault in a single 4KB page every 128KB with 1MB enabled, you'd still succee=
d
in allocating 1MB. For the range 1GB+128KB to 1GB+1MB, wouldn't there be
no page faults? So, you'd still end up scanning 256 entries (1MB/4KB)?

For the entire 4MB VMA, we only have 4 page faults? For each page, we scan
256 entries, and there's no way to scan the next possible larger order, lik=
e
512KB if 1MB has succeeded?

My point is that we need to make the 1MB allocation fail in order to disrup=
t the
continuity of pte_none(). otherwise, pte_range_none() will return true for =
the
largest order.

>
> Although now that I've actually written that down, it doesn't feel like a=
 very
> big win. Perhaps Dev can come up with an even more contrived single-page
> pre-allocation pattern that will maximise the number of PTEs we hit with =
the
> current code, and minimise it with the new code :)
>
> >
> >>
> >> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.rob=
erts@arm.com/
> >> [2]
> >> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@=
redhat.com/
> >>
> >> Thanks,
> >> Ryan
>

Thanks
Barry