linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Byungchul Park <byungchul@sk.com>
To: Bharata B Rao <bharata@amd.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Jonathan.Cameron@huawei.com, dave.hansen@intel.com,
	gourry@gourry.net, hannes@cmpxchg.org,
	mgorman@techsingularity.net, mingo@redhat.com,
	peterz@infradead.org, raghavendra.kt@amd.com, riel@surriel.com,
	rientjes@google.com, sj@kernel.org, weixugc@google.com,
	willy@infradead.org, ying.huang@linux.alibaba.com,
	ziy@nvidia.com, dave@stgolabs.net, nifan.cxl@gmail.com,
	xuezhengchu@huawei.com, yiannis@zptcorp.com,
	akpm@linux-foundation.org, david@redhat.com,
	kernel_team@skhynix.com
Subject: Re: [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread
Date: Mon, 7 Jul 2025 18:36:31 +0900	[thread overview]
Message-ID: <20250707093631.GA18924@system.software.com> (raw)
In-Reply-To: <20250616133931.206626-4-bharata@amd.com>

On Mon, Jun 16, 2025 at 07:09:30PM +0530, Bharata B Rao wrote:
> 
> kmigrated is a per-node kernel thread that migrates the
> folios marked for migration in batches. Each kmigrated
> thread walks the PFN range spanning its node and checks
> for potential migration candidates.
> 
> It depends on the fields added to extended page flags
> to determine the pages that need to be migrated and
> the target NID.
> 
> Signed-off-by: Bharata B Rao <bharata@amd.com>
> ---
>  include/linux/mmzone.h   |   5 +
>  include/linux/page_ext.h |  17 +++
>  mm/Makefile              |   3 +-
>  mm/kmigrated.c           | 223 +++++++++++++++++++++++++++++++++++++++
>  mm/mm_init.c             |   6 ++
>  mm/page_ext.c            |  11 ++
>  6 files changed, 264 insertions(+), 1 deletion(-)
>  create mode 100644 mm/kmigrated.c
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 283913d42d7b..5d7f0b8d3c91 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -853,6 +853,8 @@ enum zone_type {
> 
>  };
> 
> +int kmigrated_add_pfn(unsigned long pfn, int nid);
> +
>  #ifndef __GENERATING_BOUNDS_H
> 
>  #define ASYNC_AND_SYNC 2
> @@ -1049,6 +1051,7 @@ enum pgdat_flags {
>                                          * many pages under writeback
>                                          */
>         PGDAT_RECLAIM_LOCKED,           /* prevents concurrent reclaim */
> +       PGDAT_KMIGRATED_ACTIVATE,       /* activates kmigrated */
>  };
> 
>  enum zone_flags {
> @@ -1493,6 +1496,8 @@ typedef struct pglist_data {
>  #ifdef CONFIG_MEMORY_FAILURE
>         struct memory_failure_stats mf_stats;
>  #endif
> +       struct task_struct *kmigrated;
> +       wait_queue_head_t kmigrated_wait;
>  } pg_data_t;
> 
>  #define node_present_pages(nid)        (NODE_DATA(nid)->node_present_pages)
> diff --git a/include/linux/page_ext.h b/include/linux/page_ext.h
> index 76c817162d2f..4300c9dbafec 100644
> --- a/include/linux/page_ext.h
> +++ b/include/linux/page_ext.h
> @@ -40,8 +40,25 @@ enum page_ext_flags {
>         PAGE_EXT_YOUNG,
>         PAGE_EXT_IDLE,
>  #endif
> +       /*
> +        * 32 bits following this are used by the migrator.
> +        * The next available bit position is 33.
> +        */
> +       PAGE_EXT_MIGRATE_READY,
>  };
> 
> +#define PAGE_EXT_MIG_NID_WIDTH 10
> +#define PAGE_EXT_MIG_FREQ_WIDTH        3
> +#define PAGE_EXT_MIG_TIME_WIDTH        18
> +
> +#define PAGE_EXT_MIG_NID_SHIFT (PAGE_EXT_MIGRATE_READY + 1)
> +#define PAGE_EXT_MIG_FREQ_SHIFT        (PAGE_EXT_MIG_NID_SHIFT + PAGE_EXT_MIG_NID_WIDTH)
> +#define PAGE_EXT_MIG_TIME_SHIFT        (PAGE_EXT_MIG_FREQ_SHIFT + PAGE_EXT_MIG_FREQ_WIDTH)
> +
> +#define PAGE_EXT_MIG_NID_MASK  ((1UL << PAGE_EXT_MIG_NID_SHIFT) - 1)
> +#define PAGE_EXT_MIG_FREQ_MASK ((1UL << PAGE_EXT_MIG_FREQ_SHIFT) - 1)
> +#define PAGE_EXT_MIG_TIME_MASK ((1UL << PAGE_EXT_MIG_TIME_SHIFT) - 1)
> +
>  /*
>   * Page Extension can be considered as an extended mem_map.
>   * A page_ext page is associated with every page descriptor. The
> diff --git a/mm/Makefile b/mm/Makefile
> index 1a7a11d4933d..5a382f19105f 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -37,7 +37,8 @@ mmu-y                 := nommu.o
>  mmu-$(CONFIG_MMU)      := highmem.o memory.o mincore.o \
>                            mlock.o mmap.o mmu_gather.o mprotect.o mremap.o \
>                            msync.o page_vma_mapped.o pagewalk.o \
> -                          pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o
> +                          pgtable-generic.o rmap.o vmalloc.o vma.o vma_exec.o \
> +                          kmigrated.o
> 
> 
>  ifdef CONFIG_CROSS_MEMORY_ATTACH
> diff --git a/mm/kmigrated.c b/mm/kmigrated.c
> new file mode 100644
> index 000000000000..3caefe4be0e7
> --- /dev/null
> +++ b/mm/kmigrated.c
> @@ -0,0 +1,223 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * kmigrated is a kernel thread that runs for each node that has
> + * memory. It iterates over the node's PFNs and  migrates pages
> + * marked for migration into their targeted nodes.
> + *
> + * kmigrated depends on PAGE_EXTENSION to find out the pages that
> + * need to be migrated. In addition to a few fields that could be
> + * used by hot page promotion logic to store and evaluate the page
> + * hotness information, the extended page flags is field is extended
> + * to store the target NID for migration.
> + */
> +#include <linux/mm.h>
> +#include <linux/migrate.h>
> +#include <linux/cpuhotplug.h>
> +#include <linux/page_ext.h>
> +
> +#define KMIGRATE_DELAY MSEC_PER_SEC
> +#define KMIGRATE_BATCH 512
> +
> +static int page_ext_xchg_nid(struct page_ext *page_ext, int nid)
> +{
> +       unsigned long old_flags, flags;
> +       int old_nid;
> +
> +       old_flags = READ_ONCE(page_ext->flags);
> +       do {
> +               flags = old_flags;
> +               old_nid = (flags >> PAGE_EXT_MIG_NID_SHIFT) & PAGE_EXT_MIG_NID_MASK;
> +
> +               flags &= ~(PAGE_EXT_MIG_NID_MASK << PAGE_EXT_MIG_NID_SHIFT);
> +               flags |= (nid & PAGE_EXT_MIG_NID_MASK) << PAGE_EXT_MIG_NID_SHIFT;
> +       } while (unlikely(!try_cmpxchg(&page_ext->flags, &old_flags, flags)));
> +
> +       return old_nid;
> +}
> +
> +/*
> + * Marks the page as ready for migration.
> + *
> + * @pfn: PFN of the page
> + * @nid: Target NID to were the page needs to be migrated
> + *
> + * The request for migration is noted by setting PAGE_EXT_MIGRATE_READY
> + * in the extended page flags which the kmigrated thread would check.
> + */
> +int kmigrated_add_pfn(unsigned long pfn, int nid)
> +{
> +       struct page *page;
> +       struct page_ext *page_ext;
> +
> +       page = pfn_to_page(pfn);
> +       if (!page)
> +               return -EINVAL;
> +
> +       page_ext = page_ext_get(page);
> +       if (unlikely(!page_ext))
> +               return -EINVAL;
> +
> +       page_ext_xchg_nid(page_ext, nid);
> +       test_and_set_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags);
> +       page_ext_put(page_ext);
> +
> +       set_bit(PGDAT_KMIGRATED_ACTIVATE, &page_pgdat(page)->flags);
> +       return 0;
> +}
> +
> +/*
> + * If the page has been marked ready for migration, return
> + * the NID to which it needs to be migrated to.
> + *
> + * If not return NUMA_NO_NODE.
> + */
> +static int kmigrated_get_nid(struct page *page)
> +{
> +       struct page_ext *page_ext;
> +       int nid = NUMA_NO_NODE;
> +
> +       page_ext = page_ext_get(page);
> +       if (unlikely(!page_ext))
> +               return nid;
> +
> +       if (!test_and_clear_bit(PAGE_EXT_MIGRATE_READY, &page_ext->flags))
> +               goto out;
> +
> +       nid = page_ext_xchg_nid(page_ext, nid);
> +out:
> +       page_ext_put(page_ext);
> +       return nid;
> +}
> +
> +/*
> + * Walks the PFNs of the zone, isolates and migrates them in batches.
> + */
> +static void kmigrated_walk_zone(unsigned long start_pfn, unsigned long end_pfn,
> +                               int src_nid)
> +{
> +       int nid, cur_nid = NUMA_NO_NODE;
> +       LIST_HEAD(migrate_list);
> +       int batch_count = 0;
> +       struct folio *folio;
> +       struct page *page;
> +       unsigned long pfn;
> +
> +       for (pfn = start_pfn; pfn < end_pfn; pfn++) {

Hi,

Is it feasible to scan all the pages in each zone?  I think we should
figure out a better way so as to reduce CPU time for this purpose.

Besides the opinion above, I was thinking to design and implement a
kthread for memory placement between different tiers - I already named
it e.g. kmplaced, rather than relying on kswapd and hinting fault, lol ;)

Now that you've started, I'd like to think about it together and improve
it so that it works better.  Please cc me from the next spin.

	Byungchul

> +               if (!pfn_valid(pfn))
> +                       continue;
> +
> +               page = pfn_to_online_page(pfn);
> +               if (!page)
> +                       continue;
> +
> +               if (page_to_nid(page) != src_nid)
> +                       continue;
> +
> +               /*
> +                * TODO: Take care of folio_nr_pages() increment
> +                * to pfn count.
> +                */
> +               folio = page_folio(page);
> +               if (!folio_test_lru(folio))
> +                       continue;
> +
> +               nid = kmigrated_get_nid(page);
> +               if (nid == NUMA_NO_NODE)
> +                       continue;
> +
> +               if (page_to_nid(page) == nid)
> +                       continue;
> +
> +               if (migrate_misplaced_folio_prepare(folio, NULL, nid))
> +                       continue;
> +
> +               if (cur_nid != NUMA_NO_NODE)
> +                       cur_nid = nid;
> +
> +               if (++batch_count >= KMIGRATE_BATCH || cur_nid != nid) {
> +                       migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +                       cur_nid = nid;
> +                       batch_count = 0;
> +                       cond_resched();
> +               }
> +               list_add(&folio->lru, &migrate_list);
> +       }
> +       if (!list_empty(&migrate_list))
> +               migrate_misplaced_folios_batch(&migrate_list, cur_nid);
> +}
> +
> +static void kmigrated_do_work(pg_data_t *pgdat)
> +{
> +       struct zone *zone;
> +       int zone_idx;
> +
> +       clear_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +       for (zone_idx = 0; zone_idx < MAX_NR_ZONES; zone_idx++) {
> +               zone = &pgdat->node_zones[zone_idx];
> +
> +               if (!populated_zone(zone))
> +                       continue;
> +
> +               if (zone_is_zone_device(zone))
> +                       continue;
> +
> +               kmigrated_walk_zone(zone->zone_start_pfn, zone_end_pfn(zone),
> +                                   pgdat->node_id);
> +       }
> +}
> +
> +static inline bool kmigrated_work_requested(pg_data_t *pgdat)
> +{
> +       return test_bit(PGDAT_KMIGRATED_ACTIVATE, &pgdat->flags);
> +}
> +
> +static void kmigrated_wait_work(pg_data_t *pgdat)
> +{
> +       long timeout = msecs_to_jiffies(KMIGRATE_DELAY);
> +
> +       wait_event_timeout(pgdat->kmigrated_wait,
> +                          kmigrated_work_requested(pgdat), timeout);
> +}
> +
> +/*
> + * Per-node kthread that iterates over its PFNs and migrates the
> + * pages that have been marked for migration.
> + */
> +static int kmigrated(void *p)
> +{
> +       pg_data_t *pgdat = (pg_data_t *)p;
> +
> +       while (!kthread_should_stop()) {
> +               kmigrated_wait_work(pgdat);
> +               kmigrated_do_work(pgdat);
> +       }
> +       return 0;
> +}
> +
> +static void kmigrated_run(int nid)
> +{
> +       pg_data_t *pgdat = NODE_DATA(nid);
> +
> +       if (pgdat->kmigrated)
> +               return;
> +
> +       pgdat->kmigrated = kthread_create(kmigrated, pgdat, "kmigrated%d", nid);
> +       if (IS_ERR(pgdat->kmigrated)) {
> +               pr_err("Failed to start kmigrated for node %d\n", nid);
> +               pgdat->kmigrated = NULL;
> +       } else {
> +               wake_up_process(pgdat->kmigrated);
> +       }
> +}
> +
> +static int __init kmigrated_init(void)
> +{
> +       int nid;
> +
> +       for_each_node_state(nid, N_MEMORY)
> +               kmigrated_run(nid);
> +
> +       return 0;
> +}
> +
> +subsys_initcall(kmigrated_init)
> diff --git a/mm/mm_init.c b/mm/mm_init.c
> index f2944748f526..3a9cfd175366 100644
> --- a/mm/mm_init.c
> +++ b/mm/mm_init.c
> @@ -1398,6 +1398,11 @@ static void pgdat_init_kcompactd(struct pglist_data *pgdat)
>  static void pgdat_init_kcompactd(struct pglist_data *pgdat) {}
>  #endif
> 
> +static void pgdat_init_kmigrated(struct pglist_data *pgdat)
> +{
> +       init_waitqueue_head(&pgdat->kmigrated_wait);
> +}
> +
>  static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
>  {
>         int i;
> @@ -1407,6 +1412,7 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> 
>         pgdat_init_split_queue(pgdat);
>         pgdat_init_kcompactd(pgdat);
> +       pgdat_init_kmigrated(pgdat);
> 
>         init_waitqueue_head(&pgdat->kswapd_wait);
>         init_waitqueue_head(&pgdat->pfmemalloc_wait);
> diff --git a/mm/page_ext.c b/mm/page_ext.c
> index c351fdfe9e9a..546725fffddb 100644
> --- a/mm/page_ext.c
> +++ b/mm/page_ext.c
> @@ -76,6 +76,16 @@ static struct page_ext_operations page_idle_ops __initdata = {
>  };
>  #endif
> 
> +static bool need_page_mig(void)
> +{
> +       return true;
> +}
> +
> +static struct page_ext_operations page_mig_ops __initdata = {
> +       .need = need_page_mig,
> +       .need_shared_flags = true,
> +};
> +
>  static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_OWNER
>         &page_owner_ops,
> @@ -89,6 +99,7 @@ static struct page_ext_operations *page_ext_ops[] __initdata = {
>  #ifdef CONFIG_PAGE_TABLE_CHECK
>         &page_table_check_ops,
>  #endif
> +       &page_mig_ops,
>  };
> 
>  unsigned long page_ext_size;
> --
> 2.34.1
> 


  parent reply	other threads:[~2025-07-07  9:36 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-16 13:39 [RFC PATCH v1 0/4] Kernel thread based async batch migration Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 1/4] mm: migrate: Allow misplaced migration without VMA too Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 2/4] migrate: implement migrate_misplaced_folios_batch Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
2025-06-16 14:05   ` page_ext and memdescs Matthew Wilcox
2025-06-17  8:28     ` Bharata B Rao
2025-06-24  9:47     ` David Hildenbrand
2025-07-07  9:36   ` Byungchul Park [this message]
2025-07-08  3:43     ` [RFC PATCH v1 3/4] mm: kmigrated - Async kernel migration thread Bharata B Rao
2025-06-16 13:39 ` [RFC PATCH v1 4/4] mm: sched: Batch-migrate misplaced pages Bharata B Rao
2025-06-20  6:39 ` [RFC PATCH v1 0/4] Kernel thread based async batch migration Huang, Ying
2025-06-20  8:58   ` Bharata B Rao
2025-06-20  9:59     ` Huang, Ying

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250707093631.GA18924@system.software.com \
    --to=byungchul@sk.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=akpm@linux-foundation.org \
    --cc=bharata@amd.com \
    --cc=dave.hansen@intel.com \
    --cc=dave@stgolabs.net \
    --cc=david@redhat.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kernel_team@skhynix.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mingo@redhat.com \
    --cc=nifan.cxl@gmail.com \
    --cc=peterz@infradead.org \
    --cc=raghavendra.kt@amd.com \
    --cc=riel@surriel.com \
    --cc=rientjes@google.com \
    --cc=sj@kernel.org \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=xuezhengchu@huawei.com \
    --cc=yiannis@zptcorp.com \
    --cc=ying.huang@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox