Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Garg, Shivank" <shivankg@amd.com>
To: Matthew Wilcox <willy@infradead.org>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Cc: akpm@linux-foundation.org, bharata@amd.com,
	raghavendra.kodsarathimmappa@amd.com, Michael.Day@amd.com,
	dmaengine@vger.kernel.org, vkoul@kernel.org
Subject: Re: [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA
Date: Tue, 25 Jun 2024 14:27:50 +0530	[thread overview]
Message-ID: <670a9454-7e9b-48ee-a87b-966c90214bc0@amd.com> (raw)
In-Reply-To: <c024d035-dc94-4e89-a935-795ab2ce24e7@amd.com>


Hi,

On 6/17/2024 5:10 PM, Garg, Shivank wrote:
> Hi Matthew,
> 
> On 6/15/2024 9:32 AM, Matthew Wilcox wrote:
>> On Sat, Jun 15, 2024 at 03:45:20AM +0530, Shivank Garg wrote:
> 
>>
>> You haven't measured the important thing though -- what's the cost
>> _to userspace_?  When the CPU does the copy, the data is now
>> cache-hot in that CPU's cache.  When the DMA engine does the copy,
>> it's not cache-hot in any CPU.
>>
>> Now, this may not be a big problem.  I don't think we do anything to 
>> ensure that the CPU that is going to access the folio in userspace
>> is the one which does the copy.
>>
>> But your methodology is wrong.
> 
> You're right about importance of measuring the cost to userspace.
> I initially focused on analyzing the folio_copy overheads within migrate_pages to identify potential optimizations opportunities using DMA hardware accelerators.
> 
> To address this, I'm planning extend my experiments to measure the cost to userspace specifically related to cache-hotness. This will involve the accessing the migrated pages after the migration process is complete, and measuring the resulting latency to read/write.
> 
> This approach of DMA-offloading could possibly help in scenarios involving bulk data copying with workload size >> cache capacity or incurs a large shootdown overhead.
> 
> The userspace cost analysis will provide a more comprehensive picture of page-migration using CPU v/s DMA-offloading.
> 
> I appreciate your feedback.



I extended my earlier experiments for page migration from remote node to
a local NUMA node. This involves measuring the cost to userspace for
different workload sizes (4KB, 2MB, 256MB, and 1GB).
My experiments capture two scenarios: First, Smaller workload size (4KB and 2MB)
that fit within the CPU cache. Second, Larger workload size (512MB and 1GB)
that exceeds cache capacity.

move_pages for N pages from src_node=0 to dst_node=1

Measurement: Mean ± SD is reported in cpu cycles per page (normalized
w.r.t. number of pages = N)

move_pages: Cycles taken by move_pages(2) syscall (cost per page)
uncached_access: Cycles taken to access memory (just after clflush) for pages
on src node 1.
cached_access: Cycles taken to access memory (when everything is previously
touched) for pages on src node 1.
post_move_access: Cycles taken to access memory just after move_pages syscall
(when pages are moved to dst node 0)

Generic Kernel:
4KB:: move_pages:193154.40±50519.59  uncached_access:1269.40±163.11  cached_access:383.00±31.92  post_move_access:420.40±77.04
2MB:: move_pages:4930.36±100.74  uncached_access:793.46±82.39  cached_access:208.59±2.07  post_move_access:181.34±11.55
512MB:: move_pages:4498.93±146.95  uncached_access:656.43±23.08  cached_access:801.93±111.80  post_move_access:402.37±15.26
1GB:: move_pages:4419.88±203.91  uncached_access:627.85±13.24  cached_access:776.01±94.27  post_move_access:384.24±7.33

Results with Patched Kernel:
1. Offload disabled - Folios batch-move using CPU
4KB:: move_pages:206370.20±28303.18  uncached_access:1265.20±141.38  cached_access:385.40±54.32  post_move_access:407.80±52.60
2MB:: move_pages:5110.16±188.60  uncached_access:794.05±72.25  cached_access:208.65±1.75  post_move_access:177.48±9.93
512MB:: move_pages:4548.00±188.91  uncached_access:658.23±23.63  cached_access:777.34±113.15  post_move_access:403.48±17.27
1GB:: move_pages:4521.19±195.13  uncached_access:628.85±14.72  cached_access:750.85±98.22  post_move_access:387.79±9.49

2. Offload enabled - Folios batch-move using DMAengine
4KB:: move_pages:222818.00±22710.80  uncached_access:1277.80±145.74  cached_access:405.20±101.85  post_move_access:427.60±130.13
2MB:: move_pages:15590.80±288.89  uncached_access:799.36±76.60  cached_access:208.79±2.11  post_move_access:183.21±11.67
512MB:: move_pages:14154.06±197.59  uncached_access:649.93±20.35  cached_access:814.10±109.81  post_move_access:403.43±13.79
1GB:: move_pages:14415.04±303.83  uncached_access:629.03±14.83  cached_access:731.16±97.67  post_move_access:385.08±7.62

Code snippet to access memory:
before = rdtsc();
for (int i = 0; i < num_pages; i++) {
	for (int j = 0; j < page_size; j += 64) {
		junk += *(long *)(pages[i] + j);
	}
}
after = rdtsc();

Discussion:
1. My analysis revealed no significant difference in post-move access times
between CPU and DMA migration.
2. For smaller workloads, cached accesses are significantly faster than
uncached accesses. However, for larger workloads, caches become less effective.
3. As expected, post-migration access times are significantly lower due to
NUMA locality.
4. Just to make sure prefetchers weren't messing with things, I ran another
test with them turned off. The post-migration access cycles for DMA and CPU
with prefetcher-disabled are still similar.

Thanks,
Shivank

     prev parent reply	other threads:[~2024-06-25  8:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-14 22:15 Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 1/5] mm: separate move/undo doing on folio list from migrate_pages_batch() Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 2/5] mm: add folios_copy() for copying pages in batch during migration Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 3/5] mm: add migrate_folios_batch_move to batch the folio move operations Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 4/5] mm: add support for DMA folio Migration Shivank Garg
2024-06-14 22:15 ` [RFC PATCH 5/5] dcbm: add dma core batch migrator for batch page offloading Shivank Garg
2024-06-15  4:02 ` [RFC PATCH 0/5] Enhancements to Page Migration with Batch Offloading via DMA Matthew Wilcox
2024-06-17 11:40   ` Garg, Shivank
2024-06-25  8:57     ` Garg, Shivank [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=670a9454-7e9b-48ee-a87b-966c90214bc0@amd.com \
    --to=shivankg@amd.com \
    --cc=Michael.Day@amd.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=dmaengine@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=raghavendra.kodsarathimmappa@amd.com \
    --cc=vkoul@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox