* [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow
@ 2024-12-14 6:23 yangge1116
2024-12-15 3:29 ` kernel test robot
2024-12-15 5:32 ` kernel test robot
0 siblings, 2 replies; 3+ messages in thread
From: yangge1116 @ 2024-12-14 6:23 UTC (permalink / raw)
To: akpm
Cc: linux-mm, linux-kernel, stable, 21cnbao, david, baolin.wang,
vbabka, liuzixing, yangge
From: yangge <yangge1116@126.com>
Since commit 984fdba6a32e ("mm, compaction: use proper alloc_flags
in __compaction_suitable()") allow compaction to proceed when free
pages required for compaction reside in the CMA pageblocks, it's
possible that __compaction_suitable() always returns true, and in
some cases, it's not acceptable.
There are 4 NUMA nodes on my machine, and each NUMA node has 32GB
of memory. I have configured 16GB of CMA memory on each NUMA node,
and starting a 32GB virtual machine with device passthrough is
extremely slow, taking almost an hour.
During the start-up of the virtual machine, it will call
pin_user_pages_remote(..., FOLL_LONGTERM, ...) to allocate memory.
Long term GUP cannot allocate memory from CMA area, so a maximum
of 16 GB of no-CMA memory on a NUMA node can be used as virtual
machine memory. Since there is 16G of free CMA memory on the NUMA
node, watermark for order-0 always be met for compaction, so
__compaction_suitable() always returns true, even if the node is
unable to allocate non-CMA memory for the virtual machine.
For costly allocations, because __compaction_suitable() always
returns true, __alloc_pages_slowpath() can't exit at the appropriate
place, resulting in excessively long virtual machine startup times.
Call trace:
__alloc_pages_slowpath
if (compact_result == COMPACT_SKIPPED ||
compact_result == COMPACT_DEFERRED)
goto nopage; // should exit __alloc_pages_slowpath() from here
In order to quickly fall back to remote node, we should remove
ALLOC_CMA both in __compaction_suitable() and __isolate_free_page()
in long term GUP flow. After this fix, starting a 32GB virtual machine
with device passthrough takes only a few seconds.
Fixes: 984fdba6a32e ("mm, compaction: use proper alloc_flags in __compaction_suitable()")
Cc: <stable@vger.kernel.org>
Signed-off-by: yangge <yangge1116@126.com>
---
V2:
- using the 'cc->alloc_flags' to determin if 'ALLOC_CMA' is needed
- rich the commit log description
include/linux/compaction.h | 3 ++-
mm/compaction.c | 18 +++++++++++-------
mm/page_alloc.c | 4 +++-
mm/vmscan.c | 4 ++--
4 files changed, 18 insertions(+), 11 deletions(-)
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index e947764..0c6f97a 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -90,7 +90,8 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
struct page **page);
extern void reset_isolation_suitable(pg_data_t *pgdat);
extern bool compaction_suitable(struct zone *zone, int order,
- int highest_zoneidx);
+ int highest_zoneidx,
+ unsigned int alloc_flags);
extern void compaction_defer_reset(struct zone *zone, int order,
bool alloc_success);
diff --git a/mm/compaction.c b/mm/compaction.c
index 07bd227..585f5ab 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2381,9 +2381,11 @@ static enum compact_result compact_finished(struct compact_control *cc)
static bool __compaction_suitable(struct zone *zone, int order,
int highest_zoneidx,
+ unsigned int alloc_flags,
unsigned long wmark_target)
{
unsigned long watermark;
+ bool use_cma;
/*
* Watermarks for order-0 must be met for compaction to be able to
* isolate free pages for migration targets. This means that the
@@ -2395,25 +2397,27 @@ static bool __compaction_suitable(struct zone *zone, int order,
* even if compaction succeeds.
* For costly orders, we require low watermark instead of min for
* compaction to proceed to increase its chances.
- * ALLOC_CMA is used, as pages in CMA pageblocks are considered
- * suitable migration targets
+ * In addition to long term GUP flow, ALLOC_CMA is used, as pages in
+ * CMA pageblocks are considered suitable migration targets
*/
watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
low_wmark_pages(zone) : min_wmark_pages(zone);
watermark += compact_gap(order);
+ use_cma = !!(alloc_flags & ALLOC_CMA);
return __zone_watermark_ok(zone, 0, watermark, highest_zoneidx,
- ALLOC_CMA, wmark_target);
+ use_cma ? ALLOC_CMA : 0, wmark_target);
}
/*
* compaction_suitable: Is this suitable to run compaction on this zone now?
*/
-bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx)
+bool compaction_suitable(struct zone *zone, int order, int highest_zoneidx,
+ unsigned int alloc_flags)
{
enum compact_result compact_result;
bool suitable;
- suitable = __compaction_suitable(zone, order, highest_zoneidx,
+ suitable = __compaction_suitable(zone, order, highest_zoneidx, alloc_flags,
zone_page_state(zone, NR_FREE_PAGES));
/*
* fragmentation index determines if allocation failures are due to
@@ -2474,7 +2478,7 @@ bool compaction_zonelist_suitable(struct alloc_context *ac, int order,
available = zone_reclaimable_pages(zone) / order;
available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
if (__compaction_suitable(zone, order, ac->highest_zoneidx,
- available))
+ alloc_flags, available))
return true;
}
@@ -2499,7 +2503,7 @@ compaction_suit_allocation_order(struct zone *zone, unsigned int order,
alloc_flags))
return COMPACT_SUCCESS;
- if (!compaction_suitable(zone, order, highest_zoneidx))
+ if (!compaction_suitable(zone, order, highest_zoneidx, alloc_flags))
return COMPACT_SKIPPED;
return COMPACT_CONTINUE;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dde19db..9a5dfda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2813,6 +2813,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
{
struct zone *zone = page_zone(page);
int mt = get_pageblock_migratetype(page);
+ bool pin;
if (!is_migrate_isolate(mt)) {
unsigned long watermark;
@@ -2823,7 +2824,8 @@ int __isolate_free_page(struct page *page, unsigned int order)
* exists.
*/
watermark = zone->_watermark[WMARK_MIN] + (1UL << order);
- if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
+ pin = !!(current->flags & PF_MEMALLOC_PIN);
+ if (!zone_watermark_ok(zone, 0, watermark, 0, pin ? 0 : ALLOC_CMA))
return 0;
}
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5e03a61..806f031 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -5815,7 +5815,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
sc->reclaim_idx, 0))
return false;
- if (compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
return false;
}
@@ -6043,7 +6043,7 @@ static inline bool compaction_ready(struct zone *zone, struct scan_control *sc)
return true;
/* Compaction cannot yet proceed. Do reclaim. */
- if (!compaction_suitable(zone, sc->order, sc->reclaim_idx))
+ if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
return false;
/*
--
2.7.4
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow
2024-12-14 6:23 [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow yangge1116
@ 2024-12-15 3:29 ` kernel test robot
2024-12-15 5:32 ` kernel test robot
1 sibling, 0 replies; 3+ messages in thread
From: kernel test robot @ 2024-12-15 3:29 UTC (permalink / raw)
To: yangge1116, akpm
Cc: oe-kbuild-all, linux-mm, linux-kernel, stable, 21cnbao, david,
baolin.wang, vbabka, liuzixing, yangge
Hi,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/yangge1116-126-com/mm-compaction-don-t-use-ALLOC_CMA-in-long-term-GUP-flow/20241214-142453
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/1734157420-31110-1-git-send-email-yangge1116%40126.com
patch subject: [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow
config: i386-buildonly-randconfig-004-20241215 (https://download.01.org/0day-ci/archive/20241215/202412151139.n93obAio-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412151139.n93obAio-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412151139.n93obAio-lkp@intel.com/
All errors (new ones prefixed by >>):
mm/vmscan.c: In function 'should_continue_reclaim':
>> mm/vmscan.c:5822:21: error: too many arguments to function 'compaction_suitable'
5822 | if (compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
| ^~~~~~~~~~~~~~~~~~~
In file included from mm/vmscan.c:36:
include/linux/compaction.h:111:20: note: declared here
111 | static inline bool compaction_suitable(struct zone *zone, int order,
| ^~~~~~~~~~~~~~~~~~~
mm/vmscan.c: In function 'compaction_ready':
mm/vmscan.c:6050:14: error: too many arguments to function 'compaction_suitable'
6050 | if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
| ^~~~~~~~~~~~~~~~~~~
include/linux/compaction.h:111:20: note: declared here
111 | static inline bool compaction_suitable(struct zone *zone, int order,
| ^~~~~~~~~~~~~~~~~~~
vim +/compaction_suitable +5822 mm/vmscan.c
5778
5779 /*
5780 * Reclaim/compaction is used for high-order allocation requests. It reclaims
5781 * order-0 pages before compacting the zone. should_continue_reclaim() returns
5782 * true if more pages should be reclaimed such that when the page allocator
5783 * calls try_to_compact_pages() that it will have enough free pages to succeed.
5784 * It will give up earlier than that if there is difficulty reclaiming pages.
5785 */
5786 static inline bool should_continue_reclaim(struct pglist_data *pgdat,
5787 unsigned long nr_reclaimed,
5788 struct scan_control *sc)
5789 {
5790 unsigned long pages_for_compaction;
5791 unsigned long inactive_lru_pages;
5792 int z;
5793
5794 /* If not in reclaim/compaction mode, stop */
5795 if (!in_reclaim_compaction(sc))
5796 return false;
5797
5798 /*
5799 * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX
5800 * number of pages that were scanned. This will return to the caller
5801 * with the risk reclaim/compaction and the resulting allocation attempt
5802 * fails. In the past we have tried harder for __GFP_RETRY_MAYFAIL
5803 * allocations through requiring that the full LRU list has been scanned
5804 * first, by assuming that zero delta of sc->nr_scanned means full LRU
5805 * scan, but that approximation was wrong, and there were corner cases
5806 * where always a non-zero amount of pages were scanned.
5807 */
5808 if (!nr_reclaimed)
5809 return false;
5810
5811 /* If compaction would go ahead or the allocation would succeed, stop */
5812 for (z = 0; z <= sc->reclaim_idx; z++) {
5813 struct zone *zone = &pgdat->node_zones[z];
5814 if (!managed_zone(zone))
5815 continue;
5816
5817 /* Allocation can already succeed, nothing to do */
5818 if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
5819 sc->reclaim_idx, 0))
5820 return false;
5821
> 5822 if (compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
5823 return false;
5824 }
5825
5826 /*
5827 * If we have not reclaimed enough pages for compaction and the
5828 * inactive lists are large enough, continue reclaiming
5829 */
5830 pages_for_compaction = compact_gap(sc->order);
5831 inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
5832 if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
5833 inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
5834
5835 return inactive_lru_pages > pages_for_compaction;
5836 }
5837
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow
2024-12-14 6:23 [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow yangge1116
2024-12-15 3:29 ` kernel test robot
@ 2024-12-15 5:32 ` kernel test robot
1 sibling, 0 replies; 3+ messages in thread
From: kernel test robot @ 2024-12-15 5:32 UTC (permalink / raw)
To: yangge1116, akpm
Cc: llvm, oe-kbuild-all, linux-mm, linux-kernel, stable, 21cnbao,
david, baolin.wang, vbabka, liuzixing, yangge
Hi,
kernel test robot noticed the following build errors:
[auto build test ERROR on akpm-mm/mm-everything]
url: https://github.com/intel-lab-lkp/linux/commits/yangge1116-126-com/mm-compaction-don-t-use-ALLOC_CMA-in-long-term-GUP-flow/20241214-142453
base: https://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm.git mm-everything
patch link: https://lore.kernel.org/r/1734157420-31110-1-git-send-email-yangge1116%40126.com
patch subject: [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow
config: arm-randconfig-001-20241215 (https://download.01.org/0day-ci/archive/20241215/202412151325.svvh8EAB-lkp@intel.com/config)
compiler: clang version 20.0.0git (https://github.com/llvm/llvm-project 2dc22615fd46ab2566d0f26d5ba234ab12dc4bf8)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20241215/202412151325.svvh8EAB-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202412151325.svvh8EAB-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from mm/vmscan.c:30:
include/linux/mm_inline.h:47:41: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
47 | __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~ ^ ~~~
include/linux/mm_inline.h:49:22: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
49 | NR_ZONE_LRU_BASE + lru, nr_pages);
| ~~~~~~~~~~~~~~~~ ^ ~~~
mm/vmscan.c:409:51: warning: arithmetic between different enumeration types ('enum zone_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
409 | size += zone_page_state(zone, NR_ZONE_LRU_BASE + lru);
| ~~~~~~~~~~~~~~~~ ^ ~~~
mm/vmscan.c:1773:4: warning: arithmetic between different enumeration types ('enum vm_event_item' and 'enum zone_type') [-Wenum-enum-conversion]
1773 | __count_zid_vm_events(PGSCAN_SKIP, zid, nr_skipped[zid]);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/vmstat.h:139:34: note: expanded from macro '__count_zid_vm_events'
139 | __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
| ~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
mm/vmscan.c:2279:51: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
2279 | inactive = lruvec_page_state(lruvec, NR_LRU_BASE + inactive_lru);
| ~~~~~~~~~~~ ^ ~~~~~~~~~~~~
mm/vmscan.c:2280:49: warning: arithmetic between different enumeration types ('enum node_stat_item' and 'enum lru_list') [-Wenum-enum-conversion]
2280 | active = lruvec_page_state(lruvec, NR_LRU_BASE + active_lru);
| ~~~~~~~~~~~ ^ ~~~~~~~~~~
>> mm/vmscan.c:5822:61: error: too many arguments to function call, expected 3, have 4
5822 | if (compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
| ~~~~~~~~~~~~~~~~~~~ ^
include/linux/compaction.h:111:20: note: 'compaction_suitable' declared here
111 | static inline bool compaction_suitable(struct zone *zone, int order,
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
112 | int highest_zoneidx)
| ~~~~~~~~~~~~~~~~~~~
mm/vmscan.c:6050:61: error: too many arguments to function call, expected 3, have 4
6050 | if (!compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
| ~~~~~~~~~~~~~~~~~~~ ^
include/linux/compaction.h:111:20: note: 'compaction_suitable' declared here
111 | static inline bool compaction_suitable(struct zone *zone, int order,
| ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
112 | int highest_zoneidx)
| ~~~~~~~~~~~~~~~~~~~
mm/vmscan.c:6239:3: warning: arithmetic between different enumeration types ('enum vm_event_item' and 'enum zone_type') [-Wenum-enum-conversion]
6239 | __count_zid_vm_events(ALLOCSTALL, sc->reclaim_idx, 1);
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/linux/vmstat.h:139:34: note: expanded from macro '__count_zid_vm_events'
139 | __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta)
| ~~~~~~~~~~~~~ ^ ~~~~~~~~~~~
7 warnings and 2 errors generated.
vim +5822 mm/vmscan.c
5778
5779 /*
5780 * Reclaim/compaction is used for high-order allocation requests. It reclaims
5781 * order-0 pages before compacting the zone. should_continue_reclaim() returns
5782 * true if more pages should be reclaimed such that when the page allocator
5783 * calls try_to_compact_pages() that it will have enough free pages to succeed.
5784 * It will give up earlier than that if there is difficulty reclaiming pages.
5785 */
5786 static inline bool should_continue_reclaim(struct pglist_data *pgdat,
5787 unsigned long nr_reclaimed,
5788 struct scan_control *sc)
5789 {
5790 unsigned long pages_for_compaction;
5791 unsigned long inactive_lru_pages;
5792 int z;
5793
5794 /* If not in reclaim/compaction mode, stop */
5795 if (!in_reclaim_compaction(sc))
5796 return false;
5797
5798 /*
5799 * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX
5800 * number of pages that were scanned. This will return to the caller
5801 * with the risk reclaim/compaction and the resulting allocation attempt
5802 * fails. In the past we have tried harder for __GFP_RETRY_MAYFAIL
5803 * allocations through requiring that the full LRU list has been scanned
5804 * first, by assuming that zero delta of sc->nr_scanned means full LRU
5805 * scan, but that approximation was wrong, and there were corner cases
5806 * where always a non-zero amount of pages were scanned.
5807 */
5808 if (!nr_reclaimed)
5809 return false;
5810
5811 /* If compaction would go ahead or the allocation would succeed, stop */
5812 for (z = 0; z <= sc->reclaim_idx; z++) {
5813 struct zone *zone = &pgdat->node_zones[z];
5814 if (!managed_zone(zone))
5815 continue;
5816
5817 /* Allocation can already succeed, nothing to do */
5818 if (zone_watermark_ok(zone, sc->order, min_wmark_pages(zone),
5819 sc->reclaim_idx, 0))
5820 return false;
5821
> 5822 if (compaction_suitable(zone, sc->order, sc->reclaim_idx, 0))
5823 return false;
5824 }
5825
5826 /*
5827 * If we have not reclaimed enough pages for compaction and the
5828 * inactive lists are large enough, continue reclaiming
5829 */
5830 pages_for_compaction = compact_gap(sc->order);
5831 inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
5832 if (can_reclaim_anon_pages(NULL, pgdat->node_id, sc))
5833 inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
5834
5835 return inactive_lru_pages > pages_for_compaction;
5836 }
5837
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2024-12-15 5:33 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-12-14 6:23 [PATCH V2] mm, compaction: don't use ALLOC_CMA in long term GUP flow yangge1116
2024-12-15 3:29 ` kernel test robot
2024-12-15 5:32 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox