From: "Huang, Ying" <ying.huang@intel.com>
To: Bharata B Rao <bharata@amd.com>
Cc: <linux-kernel@vger.kernel.org>, <linux-mm@kvack.org>,
<mgorman@suse.de>, <peterz@infradead.org>, <mingo@redhat.com>,
<bp@alien8.de>, <dave.hansen@linux.intel.com>, <x86@kernel.org>,
<akpm@linux-foundation.org>, <luto@kernel.org>,
<tglx@linutronix.de>, <yue.li@memverge.com>,
<Ravikumar.Bangoria@amd.com>
Subject: Re: [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing
Date: Mon, 13 Feb 2023 14:30:53 +0800 [thread overview]
Message-ID: <87zg9i9iw2.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <72b6ec8b-f141-3807-d7f2-f853b0f0b76c@amd.com> (Bharata B. Rao's message of "Mon, 13 Feb 2023 11:22:12 +0530")
Bharata B Rao <bharata@amd.com> writes:
> On 2/13/2023 8:56 AM, Huang, Ying wrote:
>> Bharata B Rao <bharata@amd.com> writes:
>>
>>> Hi,
>>>
>>> Some hardware platforms can provide information about memory accesses
>>> that can be used to do optimal page and task placement on NUMA
>>> systems. AMD processors have a hardware facility called Instruction-
>>> Based Sampling (IBS) that can be used to gather specific metrics
>>> related to instruction fetch and execution activity. This facility
>>> can be used to perform memory access profiling based on statistical
>>> sampling.
>>>
>>> This RFC is a proof-of-concept implementation where the access
>>> information obtained from the hardware is used to drive NUMA balancing.
>>> With this it is no longer necessary to scan the address space and
>>> introduce NUMA hint faults to build task-to-page association. Hence
>>> the approach taken here is to replace the address space scanning plus
>>> hint faults with the access information provided by the hardware.
>>
>> You method can avoid the address space scanning, but cannot avoid memory
>> access fault in fact. PMU will raise NMI and then task_work to process
>> the sampled memory accesses. The overhead depends on the frequency of
>> the memory access sampling. Please measure the overhead of your method
>> in details.
>
> Yes, the address space scanning is avoided. I will measure the overhead
> of hint fault vs NMI handling path. The actual processing of the access
> from task_work context is pretty much similar to the stats processing
> from hint faults. As you note the overhead depends on the frequency of
> sampling. In this current approach, the sampling period is per-task
> and it varies based on the same logic that NUMA balancing uses to
> vary the scan period.
>
>>
>>> The access samples obtained from hardware are fed to NUMA balancing
>>> as fault-equivalents. The rest of the NUMA balancing logic that
>>> collects/aggregates the shared/private/local/remote faults and does
>>> pages/task migrations based on the faults is retained except that
>>> accesses replace faults.
>>>
>>> This early implementation is an attempt to get a working solution
>>> only and as such a lot of TODOs exist:
>>>
>>> - Perf uses IBS and we are using the same IBS for access profiling here.
>>> There needs to be a proper way to make the use mutually exclusive.
>>> - Is tying this up with NUMA balancing a reasonable approach or
>>> should we look at a completely new approach?
>>> - When accesses replace faults in NUMA balancing, a few things have
>>> to be tuned differently. All such decision points need to be
>>> identified and appropriate tuning needs to be done.
>>> - Hardware provided access information could be very useful for driving
>>> hot page promotion in tiered memory systems. Need to check if this
>>> requires different tuning/heuristics apart from what NUMA balancing
>>> already does.
>>> - Some of the values used to program the IBS counters like the sampling
>>> period etc may not be the optimal or ideal values. The sample period
>>> adjustment follows the same logic as scan period modification which
>>> may not be ideal. More experimentation is required to fine-tune all
>>> these aspects.
>>> - Currently I am acting (i,e., attempt to migrate a page) on each sampled
>>> access. Need to check if it makes sense to delay it and do batched page
>>> migration.
>>
>> You current implementation is tied with AMD IBS. You will need a
>> architecture/vendor independent framework for upstreaming.
>
> I have tried to keep it vendor and arch neutral as far
> as possible, will re-look into this of course to make the
> interfaces more robust and useful.
>
> I have defined a static key (hw_access_hints=false) which will be
> set only by the platform driver when it detects the hardware
> capability to provide memory access information. NUMA balancing
> code skips the address space scanning when it sees this capability.
> The platform driver (access fault handler) will call into the NUMA
> balancing API with linear and physical address information of the
> accessed sample. Hence any equivalent hardware functionality could
> plug into this scheme in its current form. There are checks for this
> static key in the NUMA balancing logic at a few points to decide if
> it should work based on access faults or hint faults.
>
>>
>> BTW: can IBS sampling memory writing too? Or just memory reading?
>
> IBS can tag both store and load operations.
Thanks for your information!
>>
>>> This RFC is mainly about showing how hardware provided access
>>> information could be used for NUMA balancing but I have run a
>>> few basic benchmarks from mmtests to check if this is any severe
>>> regression/overhead to any of those. Some benchmarks show some
>>> improvement, some show no significant change and a few regress.
>>> I am hopeful that with more appropriate tuning there is scope for
>>> futher improvement here especially for workloads for which NUMA
>>> matters.
>>
>> What's your expected improvement of the PMU based NUMA balancing? It
>> should come from reduced overhead? higher accuracy? Quicker response?
>> I think that it may be better to prove that with appropriate statistics
>> for at least one workload.
>
> Just to clarify, unlike PEBS, IBS works independently of PMU.
Good to known this, Thanks!
> I believe the improvement will come from reduced overhead due to
> sampling of relevant accesses only.
>
> I have a microbenchmark where two sets of threads bound to two
> NUMA nodes access the two different halves of memory which is
> initially allocated on the 1st node.
>
> On a two node Zen4 system, with 64 threads in each set accessing
> 8G of memory each from the initial allocation of 16G, I see that
> IBS driven NUMA balancing (i,e., this patchset) takes 50% less time
> to complete a fixed number of memory accesses. This could well
> be the best case and real workloads/benchmarks may not get this much
> uplift, but it does show the potential gain to be had.
Can you find a way to show the overhead of the original implementation
and your method? Then we can compare between them? Because you think
the improvement comes from the reduced overhead.
I also have interest in the pages migration throughput per second during
the test, because I suspect your method can migrate pages faster.
Best Regards,
Huang, Ying
next prev parent reply other threads:[~2023-02-13 6:31 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-08 7:35 Bharata B Rao
2023-02-08 7:35 ` [RFC PATCH 1/5] x86/ibs: In-kernel IBS driver for page access profiling Bharata B Rao
2023-02-08 7:35 ` [RFC PATCH 2/5] x86/ibs: Drive NUMA balancing via IBS access data Bharata B Rao
2023-02-08 7:35 ` [RFC PATCH 3/5] x86/ibs: Enable per-process IBS from sched switch path Bharata B Rao
2023-02-08 7:35 ` [RFC PATCH 4/5] x86/ibs: Adjust access faults sampling period Bharata B Rao
2023-02-08 7:35 ` [RFC PATCH 5/5] x86/ibs: Delay the collection of HW-provided access info Bharata B Rao
2023-02-08 18:03 ` [RFC PATCH 0/5] Memory access profiler(IBS) driven NUMA balancing Peter Zijlstra
2023-02-08 18:12 ` Dave Hansen
2023-02-09 6:04 ` Bharata B Rao
2023-02-09 14:28 ` Dave Hansen
2023-02-10 4:28 ` Bharata B Rao
2023-02-10 4:40 ` Dave Hansen
2023-02-10 15:10 ` Bharata B Rao
2023-02-09 5:57 ` Bharata B Rao
2023-02-13 2:56 ` Huang, Ying
2023-02-13 3:23 ` Bharata B Rao
2023-02-13 3:34 ` Huang, Ying
2023-02-13 3:26 ` Huang, Ying
2023-02-13 5:52 ` Bharata B Rao
2023-02-13 6:30 ` Huang, Ying [this message]
2023-02-14 4:55 ` Bharata B Rao
2023-02-15 6:07 ` Huang, Ying
2023-02-24 3:28 ` Bharata B Rao
2023-02-16 8:41 ` Bharata B Rao
2023-02-17 6:03 ` Huang, Ying
2023-02-24 3:36 ` Bharata B Rao
2023-02-27 7:54 ` Huang, Ying
2023-03-01 11:21 ` Bharata B Rao
2023-03-02 8:10 ` Huang, Ying
2023-03-03 5:25 ` Bharata B Rao
2023-03-03 5:53 ` Huang, Ying
2023-03-06 15:30 ` Bharata B Rao
2023-03-07 2:33 ` Huang, Ying
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87zg9i9iw2.fsf@yhuang6-desk2.ccr.corp.intel.com \
--to=ying.huang@intel.com \
--cc=Ravikumar.Bangoria@amd.com \
--cc=akpm@linux-foundation.org \
--cc=bharata@amd.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=luto@kernel.org \
--cc=mgorman@suse.de \
--cc=mingo@redhat.com \
--cc=peterz@infradead.org \
--cc=tglx@linutronix.de \
--cc=x86@kernel.org \
--cc=yue.li@memverge.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox