linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: "Garg, Shivank" <shivankg@amd.com>
To: akpm@linux-foundation.org, david@kernel.org
Cc: lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com,
	vbabka@kernel.org, willy@infradead.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, ziy@nvidia.com,
	matthew.brost@intel.com, joshua.hahnjy@gmail.com,
	rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net,
	ying.huang@linux.alibaba.com, apopple@nvidia.com,
	dave@stgolabs.net, Jonathan.Cameron@huawei.com, rkodsara@amd.com,
	vkoul@kernel.org, bharata@amd.com, sj@kernel.org,
	weixugc@google.com, dan.j.williams@intel.com,
	rientjes@google.com, xuezhengchu@huawei.com, yiannis@zptcorp.com,
	dave.hansen@intel.com, hannes@cmpxchg.org, jhubbard@nvidia.com,
	peterx@redhat.com, riel@surriel.com, shakeel.butt@linux.dev,
	stalexan@redhat.com, tj@kernel.org, nifan.cxl@gmail.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [RFC PATCH v4 0/6] Accelerate page migration with batch copying and hardware offload
Date: Wed, 18 Mar 2026 19:59:07 +0530	[thread overview]
Message-ID: <a69f463c-0ee3-492c-8505-710d757a1f21@amd.com> (raw)
In-Reply-To: <20260309120725.308854-3-shivankg@amd.com>



On 3/9/2026 5:37 PM, Shivank Garg wrote:
> This is the fourth RFC of the patchset to enhance page migration by
> batching folio-copy operations and enabling acceleration via DMA offload.
> 
> Single-threaded, folio-by-folio copying bottlenecks page migration in
> modern systems with deep memory hierarchies, especially for large folios
> where copy overhead dominates, leaving significant hardware potential
> untapped.
> 
> By batching the copy phase, we create an opportunity for hardware
> acceleration. This series builds the framework and provides a DMA
> offload driver (dcbm) as a reference implementation, targeting bulk
> migration workloads where offloading the copy improves throughput
> and latency while freeing the CPU cycles.
> 

[snip]

> System Info: AMD Zen 3 EPYC server (2-sockets, 32 cores, SMT Enabled),
> 1 NUMA node per socket, v7.0-rc2, DVFS set to Performance, PTDMA hardware.

> 
> a. Baseline (vanilla kernel: v7.0-rc2, single-threaded, serial folio_copy):
> 
> ============================================================================================
>        | 4K          | 16K         | 64K         | 256K        | 1M           | 2M         |
> ============================================================================================
>        |3.55±0.19    | 5.66±0.30   | 6.16±0.09   | 7.12±0.83   | 6.93±0.09   | 10.88±0.19  |
> 
> b. DMA offload (Patched Kernel, dcbm driver, N DMA channels):
> 
> ============================================================================================
> Channel Cnt| 4K      | 16K         | 64K         | 256K        | 1M          | 2M          |
> ============================================================================================
> 1      | 2.63±0.26   | 2.92±0.09   |  3.16±0.13  |  4.75±0.70  |  7.38±0.18  | 12.64±0.07  |
> 2      | 3.20±0.12   | 4.68±0.17   |  5.16±0.36  |  7.42±1.00  |  8.05±0.05  | 14.40±0.10  |
> 4      | 3.78±0.16   | 6.45±0.06   |  7.36±0.18  |  9.70±0.11  | 11.68±2.37  | 27.16±0.20  |
> 8      | 4.32±0.24   | 8.20±0.45   |  9.45±0.26  | 12.99±2.87  | 13.18±0.08  | 46.17±0.67  |
> 12     | 4.35±0.16   | 8.80±0.09   | 11.65±2.71  | 15.46±4.95  | 14.69±4.10  | 60.89±0.68  |
> 16     | 4.40±0.19   | 9.25±0.13   | 11.02±0.26  | 13.56±0.15  | 18.04±7.11  | 66.86±0.81  |

I ran experiments to evaluate DMA offload for Memory Compaction page migration (on above system)

Each NUMA ~250GB per node. I bind everything to Node 1 (CPU 32) and keep background MM daemons disabled.

The experiment has two phases: Fragmentation and Compaction(/migration)

1. Memory Fragmentation

I allocate ~248GB of anonymous memory on Node 1 and touch every page to
ensure physical backing. Then, for each 2MB-aligned region (512
contiguous 4KB pages), I free 50% of pages at evenly-spaced offsets using
MADV_DONTNEED. The freed pages return to the buddy allocator, but the
remaining 256 occupied pages in each region prevent merging into higher
order blocks.

After this, Node 1 is 100% fragmented with 50% free memory means every
hugepage allocation requires compaction.

[ ] [X] [ ] [X] [ ] [X] [ ] [X] [ ] [X] [ ] [X] ...

The fragmenter process stays alive throughout the measurement, with
oom_score_adj=-1000 to prevent the OOM killer from targeting it.


2. Compaction Trigger

To benchmark compaction in a reproducible way, I use a kernel module that
calls alloc_pages_node() in a tight loop for the target node. Each
allocation enters the slow path:
__alloc_pages_slowpath() -> try_to_compact_pages() -> compact_zone() -> migrate_pages(),
performing page migration under MR_COMPACTION. The allocation is pinned
to CPU 32 on Node 1.

Target: Allocate **16384** order-9 pages (32GB), producing ~4.5 million
4KB page migrations per run.


3. CPU Contention (Busy System)

To emulate a real-world scenario for busy-system, I run a cpu hogging
process on the same CPU as compaction:

while (run) { counter++; __asm__ volatile("" : "+r"(counter)); }

Both compaction and the hog are pinned to CPU 32, so they compete for the
same core, emulating a real-world scenario where compaction shares CPU
time with application workloads.


I measure the following metrics:
1. Wall time: elapsed time for all hugepage allocations
2. Pages migrated: delta of /proc/vmstat counters (pgmigrate_success)
3. DMA copies: DCBM sysfs counter (folios_migrated)
4. /proc/stat for the pinned CPU — user%, sys%, idle% during the run
5. Hog iterations (busy modes): total loop count of the CPU-hog process


Experiment Results:

I run four configurations on fresh reboot to avoid buddy allocator
state degradation between runs:

Baseline (vanilla kernel) and DMA (migration offload enabled),
Each on an idle and a busy system.

  Mode            Wall time(ms)  Migrated   DMA_Copy  Hog_Iters    User%     Sys%    Idle%
  --------------------------------------------------------------------------------------------
1 baseline         16708         4563506          -      -          0.00%   99.40%    0.29%
2 dma              18887         4622952    4623181      -          0.00%   76.65%   22.55%
3 busy-baseline    33256         4599846          -   62300165085  49.90%   49.75%    0.06%
4 busy-dma         32475         4602750    4604672   66022189744  56.32%   42.97%    0.06%


Inference:

1. On an idle system, wall time increases with DMA (~13%) because the
current compaction batch size (COMPACT_CLUSTER_MAX = 32 pages) is
too small for DMA to amortize its setup cost. However, kernel sys%
drops from 99.4% to 76.7%, freeing 22.5% of CPU time.

2. On a busy system, wall time decreases slightly (~2.3%) and the hog
process accumulates 6% more iterations with DMA offload. The CPU
time freed during DMA transfers goes directly to the competing
userspace workload.
This shows that DMA offload for compaction benefits busy system with
high fragmentation.


Note:
Tuning the compaction algorithm for larger DMA batches and using DMA
hardware optimized for small-size transfers should improve the results
further.


Thanks,
Shivank


      parent reply	other threads:[~2026-03-18 14:29 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09 12:07 Shivank Garg
2026-03-09 12:07 ` [RFC PATCH v4 1/6] mm: introduce folios_mc_copy() for batch folio copying Shivank Garg
2026-03-12  9:41   ` David Hildenbrand (Arm)
2026-03-15 18:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 2/6] mm/migrate: skip data copy for already-copied folios Shivank Garg
2026-03-12  9:44   ` David Hildenbrand (Arm)
2026-03-15 18:25     ` Garg, Shivank
2026-03-23 12:20       ` David Hildenbrand (Arm)
2026-03-24  8:22   ` Huang, Ying
2026-04-03 11:08     ` Garg, Shivank
2026-04-07  6:52       ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 3/6] mm/migrate: add batch-copy path in migrate_pages_batch Shivank Garg
2026-03-24  8:42   ` Huang, Ying
2026-04-03 11:09     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 4/6] mm/migrate: add copy offload registration infrastructure Shivank Garg
2026-03-09 17:54   ` Gregory Price
2026-03-10 10:07     ` Garg, Shivank
2026-03-24 10:54   ` Huang, Ying
2026-04-03 11:11     ` Garg, Shivank
2026-04-07  7:40       ` Huang, Ying
2026-03-09 12:07 ` [RFC PATCH v4 5/6] drivers/migrate_offload: add DMA batch copy driver (dcbm) Shivank Garg
2026-03-09 18:04   ` Gregory Price
2026-03-12  9:33     ` Garg, Shivank
2026-03-24  8:10   ` Huang, Ying
2026-04-03 11:06     ` Garg, Shivank
2026-03-09 12:07 ` [RFC PATCH v4 6/6] mm/migrate: adjust NR_MAX_BATCHED_MIGRATION for testing Shivank Garg
2026-03-18 14:29 ` Garg, Shivank [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=a69f463c-0ee3-492c-8505-710d757a1f21@amd.com \
    --to=shivankg@amd.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=apopple@nvidia.com \
    --cc=bharata@amd.com \
    --cc=byungchul@sk.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@kernel.org \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=jhubbard@nvidia.com \
    --cc=joshua.hahnjy@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=matthew.brost@intel.com \
    --cc=mhocko@suse.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterx@redhat.com \
    --cc=rakie.kim@sk.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=rkodsara@amd.com \
    --cc=rppt@kernel.org \
    --cc=shakeel.butt@linux.dev \
    --cc=sj@kernel.org \
    --cc=stalexan@redhat.com \
    --cc=surenb@google.com \
    --cc=tj@kernel.org \
    --cc=vbabka@kernel.org \
    --cc=vkoul@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox