From: Barry Song <baohua@kernel.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>,
akpm@linux-foundation.org, david@redhat.com,
anshuman.khandual@arm.com, hughd@google.com,
ioworker0@gmail.com, wangkefeng.wang@huawei.com,
baolin.wang@linux.alibaba.com, gshan@redhat.com,
linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
Date: Tue, 17 Sep 2024 16:44:30 +0800 [thread overview]
Message-ID: <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com> (raw)
In-Reply-To: <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>
On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 17/09/2024 04:55, Dev Jain wrote:
> >
> > On 9/16/24 18:54, Matthew Wilcox wrote:
> >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
> >>> We use pte_range_none() to determine whether contiguous PTEs are empty
> >>> for an mTHP allocation. Instead of iterating the while loop for every
> >>> order, use some information, which is the first set PTE found, from the
> >>> previous iteration, to eliminate some cases. The key to understanding
> >>> the correctness of the patch is that the ranges we want to examine
> >>> form a strictly decreasing sequence of nested intervals.
> >> This is a lot more complicated. Do you have any numbers that indicate
> >> that it's faster? Yes, it's fewer memory references, but you've gone
> >> from a simple linear scan that's easy to prefetch to an exponential scan
> >> that might confuse the prefetchers.
> >
> > I do have some numbers, I tested with a simple program, and also used
> > ktime API, with the latter, enclosing from "order = highest_order(orders)"
> > till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
> > estimate is that without the patch, it takes 1700 ns to execute, with the
> > patch, on an average it takes 80 - 100ns less. I cannot think of a good
> > testing program...
> >
> > For the prefetching thingy, I am still doing a linear scan, and in each
> > iteration, with the patch, the range I am scanning is going to strictly
> > lie inside the range I would have scanned without the patch. Won't the
> > compiler and the CPU still do prefetching, but on a smaller range; where
> > does the prefetcher get confused? I confess, I do not understand this
> > very well.
> >
>
> A little history on this; My original "RFC v2" for mTHP included this
> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
> I did. Then at v8, DavidH suggested we could benefit from this sort of
> optimization, but we agreed to do it later as a separate change [2]:
>
> """
> >> Comment: Likely it would make sense to scan only once and determine the
> >> "largest none range" around that address, having the largest suitable order
> >> in mind.
> >
> > Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> > IIRC. Perhaps this an optimization opportunity for later?
>
> Yes, definetly.
> """
>
> Dev independently discovered this opportunity while reading the code, and I
> pointed him to the history, and suggested it would likely be worthwhile to send
> a patch.
>
> My view is that I don't see how this can harm performance; in the common case,
> when a single order is enabled, this is essentially the same as before. But when
> there are multiple orders enabled, we are now just doing a single linear scan of
> the ptes rather than multiple scans. There will likely be some stack accesses
> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
> between the stack and other areas of memory.
>
> Perhaps some perf numbers would help; I think the simplest way to gather some
> numbers would be to create a microbenchmark to allocate a large VMA, then fault
> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
> this patch will show improvement if the scan is currently a significant portion
> of the page fault.
>
> If the proposed benchmark shows an improvement, and we don't see any regression
> when only enabling 64K, then my vote would be to accept the patch.
Agreed. The challenge now is how to benchmark this. In a system
without fragmentation,
we consistently succeed in allocating the largest size (1MB).
Therefore, we need an
environment where allocations of various sizes can fail proportionally, allowing
pte_range_none() to fail on larger sizes but succeed on smaller ones.
It seems we can't micro-benchmark this with a small program.
>
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@arm.com/
> [2]
> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com/
>
> Thanks,
> Ryan
next prev parent reply other threads:[~2024-12-05 15:19 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-09-13 9:19 Dev Jain
2024-09-16 5:12 ` Barry Song
2024-09-16 5:20 ` Dev Jain
2024-09-16 5:58 ` Barry Song
2024-09-16 13:24 ` Matthew Wilcox
2024-09-17 3:35 ` Lance Yang
2024-09-17 5:35 ` Barry Song
2024-09-17 3:55 ` Dev Jain
2024-09-17 8:29 ` Ryan Roberts
2024-09-17 8:44 ` Barry Song [this message]
2024-09-17 8:54 ` Ryan Roberts
2024-09-17 9:09 ` Barry Song
2024-09-17 10:19 ` Ryan Roberts
2024-09-17 10:12 ` David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com \
--to=baohua@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=anshuman.khandual@arm.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@redhat.com \
--cc=dev.jain@arm.com \
--cc=gshan@redhat.com \
--cc=hughd@google.com \
--cc=ioworker0@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=wangkefeng.wang@huawei.com \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox