linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Barry Song <baohua@kernel.org>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Dev Jain <dev.jain@arm.com>, Matthew Wilcox <willy@infradead.org>,
	akpm@linux-foundation.org,  david@redhat.com,
	anshuman.khandual@arm.com, hughd@google.com,
	 ioworker0@gmail.com, wangkefeng.wang@huawei.com,
	 baolin.wang@linux.alibaba.com, gshan@redhat.com,
	linux-kernel@vger.kernel.org,  linux-mm@kvack.org
Subject: Re: [PATCH] mm: Compute mTHP order efficiently
Date: Tue, 17 Sep 2024 16:44:30 +0800	[thread overview]
Message-ID: <CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com> (raw)
In-Reply-To: <bde86cb8-fe30-4747-b3a7-cc40d0850f10@arm.com>

On Tue, Sep 17, 2024 at 4:29 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 17/09/2024 04:55, Dev Jain wrote:
> >
> > On 9/16/24 18:54, Matthew Wilcox wrote:
> >> On Fri, Sep 13, 2024 at 02:49:02PM +0530, Dev Jain wrote:
> >>> We use pte_range_none() to determine whether contiguous PTEs are empty
> >>> for an mTHP allocation. Instead of iterating the while loop for every
> >>> order, use some information, which is the first set PTE found, from the
> >>> previous iteration, to eliminate some cases. The key to understanding
> >>> the correctness of the patch is that the ranges we want to examine
> >>> form a strictly decreasing sequence of nested intervals.
> >> This is a lot more complicated.  Do you have any numbers that indicate
> >> that it's faster?  Yes, it's fewer memory references, but you've gone
> >> from a simple linear scan that's easy to prefetch to an exponential scan
> >> that might confuse the prefetchers.
> >
> > I do have some numbers, I tested with a simple program, and also used
> > ktime API, with the latter, enclosing from "order = highest_order(orders)"
> > till "pte_unmap(pte)" (enclosing the entire while loop), a rough average
> > estimate is that without the patch, it takes 1700 ns to execute, with the
> > patch, on an average it takes 80 - 100ns less. I cannot think of a good
> > testing program...
> >
> > For the prefetching thingy, I am still doing a linear scan, and in each
> > iteration, with the patch, the range I am scanning is going to strictly
> > lie inside the range I would have scanned without the patch. Won't the
> > compiler and the CPU still do prefetching, but on a smaller range; where
> > does the prefetcher get confused? I confess, I do not understand this
> > very well.
> >
>
> A little history on this; My original "RFC v2" for mTHP included this
> optimization [1], but Yu Zhou suggested dropping it to keep things simple, which
> I did. Then at v8, DavidH suggested we could benefit from this sort of
> optimization, but we agreed to do it later as a separate change [2]:
>
> """
> >> Comment: Likely it would make sense to scan only once and determine the
> >> "largest none range" around that address, having the largest suitable order
> >> in mind.
> >
> > Yes, that's how I used to do it, but Yu Zhou requested simplifying to this,
> > IIRC. Perhaps this an optimization opportunity for later?
>
> Yes, definetly.
> """
>
> Dev independently discovered this opportunity while reading the code, and I
> pointed him to the history, and suggested it would likely be worthwhile to send
> a patch.
>
> My view is that I don't see how this can harm performance; in the common case,
> when a single order is enabled, this is essentially the same as before. But when
> there are multiple orders enabled, we are now just doing a single linear scan of
> the ptes rather than multiple scans. There will likely be some stack accesses
> interleved, but I'd be gobsmacked if the prefetchers can't tell the difference
> between the stack and other areas of memory.
>
> Perhaps some perf numbers would help; I think the simplest way to gather some
> numbers would be to create a microbenchmark to allocate a large VMA, then fault
> in single pages at a given stride (say, 1 every 128K), then enable 1M, 512K,
> 256K, 128K and 64K mTHP, then memset the entire VMA. It's a bit contrived, but
> this patch will show improvement if the scan is currently a significant portion
> of the page fault.
>
> If the proposed benchmark shows an improvement, and we don't see any regression
> when only enabling 64K, then my vote would be to accept the patch.

Agreed. The challenge now is how to benchmark this. In a system
without fragmentation,
we consistently succeed in allocating the largest size (1MB).
Therefore, we need an
environment where allocations of various sizes can fail proportionally, allowing
pte_range_none() to fail on larger sizes but succeed on smaller ones.

It seems we can't micro-benchmark this with a small program.

>
> [1] https://lore.kernel.org/linux-mm/20230414130303.2345383-7-ryan.roberts@arm.com/
> [2]
> https://lore.kernel.org/linux-mm/ca649aad-7b76-4c6d-b513-26b3d58f8e68@redhat.com/
>
> Thanks,
> Ryan


  reply	other threads:[~2024-12-05 15:19 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-13  9:19 Dev Jain
2024-09-16  5:12 ` Barry Song
2024-09-16  5:20   ` Dev Jain
2024-09-16  5:58     ` Barry Song
2024-09-16 13:24 ` Matthew Wilcox
2024-09-17  3:35   ` Lance Yang
2024-09-17  5:35     ` Barry Song
2024-09-17  3:55   ` Dev Jain
2024-09-17  8:29     ` Ryan Roberts
2024-09-17  8:44       ` Barry Song [this message]
2024-09-17  8:54         ` Ryan Roberts
2024-09-17  9:09           ` Barry Song
2024-09-17 10:19             ` Ryan Roberts
2024-09-17 10:12     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4ysXDAzODPkmeuaEAJnXFofxdVYvsdNcyN-xjER+rjMzQ@mail.gmail.com \
    --to=baohua@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=gshan@redhat.com \
    --cc=hughd@google.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox