linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
@ 2025-02-26  3:22 Gabriel Krisman Bertazi
  2025-02-26  6:54 ` Michal Hocko
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-02-26  3:22 UTC (permalink / raw)
  To: akpm
  Cc: linux-mm, Gabriel Krisman Bertazi, Michal Hocko, Mel Gorman,
	Vlastimil Babka, Baoquan He

Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
->lowmem_reserve[] for empty zone") removes the protection of lower
zones from allocations targeting memory-less high zones.  This had an
unintended impact on the pattern of reclaims because it makes the
high-zone-targeted allocation more likely to succeed in lower zones,
which adds pressure to said zones.  I.e, the following corresponding
checks in zone_watermark_ok/zone_watermark_fast are less likely to
trigger:

        if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
                return false;

As a result, we are observing an increase in reclaim and kswapd scans,
due to the increased pressure.  This was initially observed as increased
latency in filesystem operations when benchmarking with fio on a machine
with some memory-less zones, but it has since been associated with
increased contention in locks related to memory reclaim.  By reverting
this patch, the original performance was recovered on that machine.

The original commit was introduced as a clarification of the
/proc/zoneinfo output, so it doesn't seem there are usecases depending
on it, making the revert a simple solution.

Cc: Michal Hocko <mhocko@kernel.org>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Baoquan He <bhe@redhat.com>
Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
---
 mm/page_alloc.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 579789600a3c..fe986e6de7a0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
 
 			for (j = i + 1; j < MAX_NR_ZONES; j++) {
 				struct zone *upper_zone = &pgdat->node_zones[j];
-				bool empty = !zone_managed_pages(upper_zone);
 
 				managed_pages += zone_managed_pages(upper_zone);
 
-				if (clear || empty)
+				if (clear)
 					zone->lowmem_reserve[j] = 0;
 				else
 					zone->lowmem_reserve[j] = managed_pages / ratio;
-- 
2.47.0



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  3:22 [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" Gabriel Krisman Bertazi
@ 2025-02-26  6:54 ` Michal Hocko
  2025-02-26 10:00   ` Baoquan He
                     ` (2 more replies)
  2025-02-26 13:00 ` Vlastimil Babka
  2025-02-27 11:50 ` Mel Gorman
  2 siblings, 3 replies; 19+ messages in thread
From: Michal Hocko @ 2025-02-26  6:54 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: akpm, linux-mm, Mel Gorman, Vlastimil Babka, Baoquan He

On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> ->lowmem_reserve[] for empty zone") removes the protection of lower
> zones from allocations targeting memory-less high zones.  This had an
> unintended impact on the pattern of reclaims because it makes the
> high-zone-targeted allocation more likely to succeed in lower zones,
> which adds pressure to said zones.  I.e, the following corresponding
> checks in zone_watermark_ok/zone_watermark_fast are less likely to
> trigger:
> 
>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>                 return false;
> 
> As a result, we are observing an increase in reclaim and kswapd scans,
> due to the increased pressure.  This was initially observed as increased
> latency in filesystem operations when benchmarking with fio on a machine
> with some memory-less zones, but it has since been associated with
> increased contention in locks related to memory reclaim.  By reverting
> this patch, the original performance was recovered on that machine.

I think it would be nice to show the memory layout on that machine (is
there any movable or device zone)?

Exact reclaim patterns are really hard to predict and it is little bit
surprising the said patch has caused an increased kswapd activity
because I would expect that there will be more reclaim with the lowmem
reserves in place. But it is quite possible that the higher zone memory
pressure is just tipping over and increase the lowmem pressure enough
that it shows up.

In any case 96a5c186efff seems incorrect because it assumes that the
protection has anything to do with how higher zone is populated while
the protection fundamentaly protects lower zone from higher zones
allocation. Those allocations are independent on the actual memory in
that zone.

> The original commit was introduced as a clarification of the
> /proc/zoneinfo output, so it doesn't seem there are usecases depending
> on it, making the revert a simple solution.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Baoquan He <bhe@redhat.com>
> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!

> ---
>  mm/page_alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 579789600a3c..fe986e6de7a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>  
>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>  				struct zone *upper_zone = &pgdat->node_zones[j];
> -				bool empty = !zone_managed_pages(upper_zone);
>  
>  				managed_pages += zone_managed_pages(upper_zone);
>  
> -				if (clear || empty)
> +				if (clear)
>  					zone->lowmem_reserve[j] = 0;
>  				else
>  					zone->lowmem_reserve[j] = managed_pages / ratio;
> -- 
> 2.47.0

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  6:54 ` Michal Hocko
@ 2025-02-26 10:00   ` Baoquan He
  2025-02-26 10:52     ` Michal Hocko
  2025-02-26 13:07   ` Vlastimil Babka
  2025-02-26 16:05   ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 19+ messages in thread
From: Baoquan He @ 2025-02-26 10:00 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On 02/26/25 at 07:54am, Michal Hocko wrote:
> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
> > Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> > ->lowmem_reserve[] for empty zone") removes the protection of lower
> > zones from allocations targeting memory-less high zones.  This had an
> > unintended impact on the pattern of reclaims because it makes the
> > high-zone-targeted allocation more likely to succeed in lower zones,
> > which adds pressure to said zones.  I.e, the following corresponding
> > checks in zone_watermark_ok/zone_watermark_fast are less likely to
> > trigger:
> > 
> >         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
> >                 return false;
> > 
> > As a result, we are observing an increase in reclaim and kswapd scans,
> > due to the increased pressure.  This was initially observed as increased
> > latency in filesystem operations when benchmarking with fio on a machine
> > with some memory-less zones, but it has since been associated with
> > increased contention in locks related to memory reclaim.  By reverting
> > this patch, the original performance was recovered on that machine.
> 
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?

Yeah, printing /proc/zoneinfo and pasting it here would be helpful.

> 
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.
> 
> In any case 96a5c186efff seems incorrect because it assumes that the
> protection has anything to do with how higher zone is populated while
> the protection fundamentaly protects lower zone from higher zones
> allocation. Those allocations are independent on the actual memory in
> that zone.

The protection value was introduced in non-NUMA time, and later adapted
to NUMA system. While it still only reflects each zone with other zones
within one specific node. We may need take this opportunity to
reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
nodes into account.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 10:00   ` Baoquan He
@ 2025-02-26 10:52     ` Michal Hocko
  2025-02-26 11:00       ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-02-26 10:52 UTC (permalink / raw)
  To: Baoquan He
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Wed 26-02-25 18:00:26, Baoquan He wrote:
> On 02/26/25 at 07:54am, Michal Hocko wrote:
[...]
> > In any case 96a5c186efff seems incorrect because it assumes that the
> > protection has anything to do with how higher zone is populated while
> > the protection fundamentaly protects lower zone from higher zones
> > allocation. Those allocations are independent on the actual memory in
> > that zone.
> 
> The protection value was introduced in non-NUMA time, and later adapted
> to NUMA system. While it still only reflects each zone with other zones
> within one specific node. We may need take this opportunity to
> reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> nodes into account.

Are you suggesting zone fallback list to interleave nodes? I.e.
numa_zonelist_order we used to have in the past and that has been
removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 10:52     ` Michal Hocko
@ 2025-02-26 11:00       ` Michal Hocko
  2025-02-26 11:51         ` Baoquan He
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-02-26 11:00 UTC (permalink / raw)
  To: Baoquan He
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > On 02/26/25 at 07:54am, Michal Hocko wrote:
> [...]
> > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > protection has anything to do with how higher zone is populated while
> > > the protection fundamentaly protects lower zone from higher zones
> > > allocation. Those allocations are independent on the actual memory in
> > > that zone.
> > 
> > The protection value was introduced in non-NUMA time, and later adapted
> > to NUMA system. While it still only reflects each zone with other zones
> > within one specific node. We may need take this opportunity to
> > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > nodes into account.
> 
> Are you suggesting zone fallback list to interleave nodes? I.e.
> numa_zonelist_order we used to have in the past and that has been
> removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").

Btw. has 96a5c186efff tried to fix any actual runtime problem? The
changelog doesn't say much about that. 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 11:00       ` Michal Hocko
@ 2025-02-26 11:51         ` Baoquan He
  2025-02-26 12:01           ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Baoquan He @ 2025-02-26 11:51 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On 02/26/25 at 12:00pm, Michal Hocko wrote:
> On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > [...]
> > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > protection has anything to do with how higher zone is populated while
> > > > the protection fundamentaly protects lower zone from higher zones
> > > > allocation. Those allocations are independent on the actual memory in
> > > > that zone.
> > > 
> > > The protection value was introduced in non-NUMA time, and later adapted
> > > to NUMA system. While it still only reflects each zone with other zones
> > > within one specific node. We may need take this opportunity to
> > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > nodes into account.
> > 
> > Are you suggesting zone fallback list to interleave nodes? I.e.
> > numa_zonelist_order we used to have in the past and that has been
> > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").

Hmm, if Gabriel can provide detailed node/zone information of the
system, we can check if there's anything we can do to adjust
zone->lowmem_reserve[] to reflect its real usage and semantics. I
haven't thought of the whole zone fallback list to interleave nodes
which invovles a lot of change.

> 
> Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> changelog doesn't say much about that. 

No, no actual problem was observed on tht. I was just trying to make
clear the semantics because I was confused by its obscure value printing
of zone->lowmem_reserve[] in /proc/zoneinfo.

I think we can merge this reverting firstly, then to investigate how to
better clarify it.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 11:51         ` Baoquan He
@ 2025-02-26 12:01           ` Michal Hocko
  2025-02-26 15:57             ` Baoquan He
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-02-26 12:01 UTC (permalink / raw)
  To: Baoquan He
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Wed 26-02-25 19:51:12, Baoquan He wrote:
> On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > [...]
> > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > protection has anything to do with how higher zone is populated while
> > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > allocation. Those allocations are independent on the actual memory in
> > > > > that zone.
> > > > 
> > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > to NUMA system. While it still only reflects each zone with other zones
> > > > within one specific node. We may need take this opportunity to
> > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > nodes into account.
> > > 
> > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > numa_zonelist_order we used to have in the past and that has been
> > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> 
> Hmm, if Gabriel can provide detailed node/zone information of the
> system, we can check if there's anything we can do to adjust
> zone->lowmem_reserve[] to reflect its real usage and semantics.

Why do you think anything needs to be adjusted?

> I haven't thought of the whole zone fallback list to interleave nodes
> which invovles a lot of change.
> 
> > 
> > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > changelog doesn't say much about that. 
> 
> No, no actual problem was observed on tht.

OK

> I was just trying to make
> clear the semantics because I was confused by its obscure value printing
> of zone->lowmem_reserve[] in /proc/zoneinfo.
> 
> I think we can merge this reverting firstly, then to investigate how to
> better clarify it.

What do you think needs to be clarify? How exactly is the original
output confusing?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  3:22 [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" Gabriel Krisman Bertazi
  2025-02-26  6:54 ` Michal Hocko
@ 2025-02-26 13:00 ` Vlastimil Babka
  2025-02-27 11:50 ` Mel Gorman
  2 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2025-02-26 13:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, akpm
  Cc: linux-mm, Michal Hocko, Mel Gorman, Baoquan He



On 2/26/25 4:22 AM, Gabriel Krisman Bertazi wrote:
> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> ->lowmem_reserve[] for empty zone") removes the protection of lower
> zones from allocations targeting memory-less high zones.  This had an
> unintended impact on the pattern of reclaims because it makes the
> high-zone-targeted allocation more likely to succeed in lower zones,
> which adds pressure to said zones.  I.e, the following corresponding
> checks in zone_watermark_ok/zone_watermark_fast are less likely to
> trigger:
> 
>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>                 return false;
> 
> As a result, we are observing an increase in reclaim and kswapd scans,
> due to the increased pressure.  This was initially observed as increased
> latency in filesystem operations when benchmarking with fio on a machine
> with some memory-less zones, but it has since been associated with
> increased contention in locks related to memory reclaim.  By reverting
> this patch, the original performance was recovered on that machine.
> 
> The original commit was introduced as a clarification of the
> /proc/zoneinfo output, so it doesn't seem there are usecases depending
> on it, making the revert a simple solution.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Baoquan He <bhe@redhat.com>
> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

Reviewed-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  mm/page_alloc.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 579789600a3c..fe986e6de7a0 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>  
>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>  				struct zone *upper_zone = &pgdat->node_zones[j];
> -				bool empty = !zone_managed_pages(upper_zone);
>  
>  				managed_pages += zone_managed_pages(upper_zone);
>  
> -				if (clear || empty)
> +				if (clear)
>  					zone->lowmem_reserve[j] = 0;
>  				else
>  					zone->lowmem_reserve[j] = managed_pages / ratio;



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  6:54 ` Michal Hocko
  2025-02-26 10:00   ` Baoquan He
@ 2025-02-26 13:07   ` Vlastimil Babka
  2025-02-26 16:05   ` Gabriel Krisman Bertazi
  2 siblings, 0 replies; 19+ messages in thread
From: Vlastimil Babka @ 2025-02-26 13:07 UTC (permalink / raw)
  To: Michal Hocko, Gabriel Krisman Bertazi
  Cc: akpm, linux-mm, Mel Gorman, Baoquan He



On 2/26/25 7:54 AM, Michal Hocko wrote:
> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
>> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
>> ->lowmem_reserve[] for empty zone") removes the protection of lower
>> zones from allocations targeting memory-less high zones.  This had an
>> unintended impact on the pattern of reclaims because it makes the
>> high-zone-targeted allocation more likely to succeed in lower zones,
>> which adds pressure to said zones.  I.e, the following corresponding
>> checks in zone_watermark_ok/zone_watermark_fast are less likely to
>> trigger:
>>
>>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>>                 return false;
>>
>> As a result, we are observing an increase in reclaim and kswapd scans,
>> due to the increased pressure.  This was initially observed as increased
>> latency in filesystem operations when benchmarking with fio on a machine
>> with some memory-less zones, but it has since been associated with
>> increased contention in locks related to memory reclaim.  By reverting
>> this patch, the original performance was recovered on that machine.
> 
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?
> 
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.

My theory is that the commit caused a difference between kernel and
userspace allocations, with bad consequences. Kernel allocation will
have highest_zoneidx = NORMAL and thus observe the lowmem_reserve for
for ZONE_DMA32 unchanged. Userspace allocation will have highest_zoneidx
= MOVABLE and thus will see zero lowmem_reserve and will allocate from
ZONE_DMA32 (or even ZONE_DMA) when previously it wouldn't.

Then a kernel allocation might happen to wake up kswapd, which will see
the DMA/DMA32 below watermark (with NORMAL highest_zoneidx) and try to
reclaim them back above the watermarks. Since the LRU lists are per-node
and nor per-zone anymore, it will spend a lot of effort to find pages
from DMA/DMA32 to reclaim.

> In any case 96a5c186efff seems incorrect because it assumes that the
> protection has anything to do with how higher zone is populated while
> the protection fundamentaly protects lower zone from higher zones
> allocation. Those allocations are independent on the actual memory in
> that zone.
> 
>> The original commit was introduced as a clarification of the
>> /proc/zoneinfo output, so it doesn't seem there are usecases depending
>> on it, making the revert a simple solution.
>>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Baoquan He <bhe@redhat.com>
>> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
>> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>
> 
> Acked-by: Michal Hocko <mhocko@suse.com>
> Thanks!
> 
>> ---
>>  mm/page_alloc.c | 3 +--
>>  1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 579789600a3c..fe986e6de7a0 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -5849,11 +5849,10 @@ static void setup_per_zone_lowmem_reserve(void)
>>  
>>  			for (j = i + 1; j < MAX_NR_ZONES; j++) {
>>  				struct zone *upper_zone = &pgdat->node_zones[j];
>> -				bool empty = !zone_managed_pages(upper_zone);
>>  
>>  				managed_pages += zone_managed_pages(upper_zone);
>>  
>> -				if (clear || empty)
>> +				if (clear)
>>  					zone->lowmem_reserve[j] = 0;
>>  				else
>>  					zone->lowmem_reserve[j] = managed_pages / ratio;
>> -- 
>> 2.47.0
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 12:01           ` Michal Hocko
@ 2025-02-26 15:57             ` Baoquan He
  2025-02-26 17:46               ` Michal Hocko
  2025-02-27  9:16               ` Vlastimil Babka
  0 siblings, 2 replies; 19+ messages in thread
From: Baoquan He @ 2025-02-26 15:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On 02/26/25 at 01:01pm, Michal Hocko wrote:
> On Wed 26-02-25 19:51:12, Baoquan He wrote:
> > On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > > [...]
> > > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > > protection has anything to do with how higher zone is populated while
> > > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > > allocation. Those allocations are independent on the actual memory in
> > > > > > that zone.
> > > > > 
> > > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > > to NUMA system. While it still only reflects each zone with other zones
> > > > > within one specific node. We may need take this opportunity to
> > > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > > nodes into account.
> > > > 
> > > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > > numa_zonelist_order we used to have in the past and that has been
> > > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> > 
> > Hmm, if Gabriel can provide detailed node/zone information of the
> > system, we can check if there's anything we can do to adjust
> > zone->lowmem_reserve[] to reflect its real usage and semantics.
> 
> Why do you think anything needs to be adjusted?

No, I don't think like that. But I am wondering what makes you get
the conclusion.

> 
> > I haven't thought of the whole zone fallback list to interleave nodes
> > which invovles a lot of change.
> > 
> > > 
> > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > > changelog doesn't say much about that. 
> > 
> > No, no actual problem was observed on tht.
> 
> OK
> 
> > I was just trying to make
> > clear the semantics because I was confused by its obscure value printing
> > of zone->lowmem_reserve[] in /proc/zoneinfo.
> > 
> > I think we can merge this reverting firstly, then to investigate how to
> > better clarify it.
> 
> What do you think needs to be clarify? How exactly is the original
> output confusing?

When I did the change, I wrote the reason in commit log. I don't think
you care to read it from your talking. Let me explain again, in
setup_per_zone_lowmem_reserve(), each zone's protection value is
calculated based on its own node's zones. E.g below on node 0, its
Movable zone and Device zone are empty but still show influence on
Normal/DMA32/DMA zone, this is unreasonable from the protection value
calculating code and its showing.

If really as your colleague Gabriel said, the protection value of DMA zone
on node 0 will impact allocation when targeted zone is Movable zone, we
may need consider the protection value calcuation acorss nodes. Because
the impact happens among different nodes. I only said we can do
investigation, I didn't said we need change or have to change.

Node 0, zone      DMA
  ......
  pages free     2816
        ......
        protection: (0, 1582, 23716, 23716, 23716)
Node 0, zone    DMA32
  pages free     403269
        ......
        protection: (0, 0, 22134, 22134, 22134)
Node 0, zone   Normal
  pages free     5423879
        ......
        protection: (0, 0, 0, 0, 0)
Node 0, zone  Movable
  pages free     0
        ......
        protection: (0, 0, 0, 0, 0)
Node 0, zone   Device
  pages free     0
        ......
        protection: (0, 0, 0, 0, 0)

> 
> -- 
> Michal Hocko
> SUSE Labs
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  6:54 ` Michal Hocko
  2025-02-26 10:00   ` Baoquan He
  2025-02-26 13:07   ` Vlastimil Babka
@ 2025-02-26 16:05   ` Gabriel Krisman Bertazi
  2025-02-26 23:00     ` Andrew Morton
  2 siblings, 1 reply; 19+ messages in thread
From: Gabriel Krisman Bertazi @ 2025-02-26 16:05 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, linux-mm, Mel Gorman, Vlastimil Babka, Baoquan He

[-- Attachment #1: Type: text/plain, Size: 2197 bytes --]

Michal Hocko <mhocko@suse.com> writes:

> On Tue 25-02-25 22:22:58, Gabriel Krisman Bertazi wrote:
>> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
>> ->lowmem_reserve[] for empty zone") removes the protection of lower
>> zones from allocations targeting memory-less high zones.  This had an
>> unintended impact on the pattern of reclaims because it makes the
>> high-zone-targeted allocation more likely to succeed in lower zones,
>> which adds pressure to said zones.  I.e, the following corresponding
>> checks in zone_watermark_ok/zone_watermark_fast are less likely to
>> trigger:
>> 
>>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>>                 return false;
>> 
>> As a result, we are observing an increase in reclaim and kswapd scans,
>> due to the increased pressure.  This was initially observed as increased
>> latency in filesystem operations when benchmarking with fio on a machine
>> with some memory-less zones, but it has since been associated with
>> increased contention in locks related to memory reclaim.  By reverting
>> this patch, the original performance was recovered on that machine.
>
> I think it would be nice to show the memory layout on that machine (is
> there any movable or device zone)?
>
> Exact reclaim patterns are really hard to predict and it is little bit
> surprising the said patch has caused an increased kswapd activity
> because I would expect that there will be more reclaim with the lowmem
> reserves in place. But it is quite possible that the higher zone memory
> pressure is just tipping over and increase the lowmem pressure enough
> that it shows up.

For reference, I collected vmstat with and without this patch on a
freshly booted system running intensive randread io from an nvme for 5
minutes. I got:

rpm-6.12.0-slfo.1.2 ->  pgscan_kswapd 5629543865
Patched             ->  pgscan_kswapd 33580844

33M scans is similar to what we had in kernels predating this patch.
These numbers is fairly representative of the workload on this machine, as
measured in several runs.  So we are talking about a 2-order of
magnitude increase.

Attached is the zoneinfo with my revert patch applied.


[-- Attachment #2: zoneinfo --]
[-- Type: application/octet-stream, Size: 30034 bytes --]

Node 0, zone      DMA
  per-node stats
      nr_inactive_anon 2442
      nr_active_anon 92
      nr_inactive_file 1178
      nr_active_file 5819
      nr_unevictable 768
      nr_slab_reclaimable 3514
      nr_slab_unreclaimable 53499
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 79
      workingset_refault_anon 0
      workingset_refault_file 1226
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 2286
      nr_mapped    2457
      nr_file_pages 8014
      nr_dirty     0
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     1017
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   62518655
      nr_written   62518655
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 130
      nr_foll_pin_released 130
      nr_kernel_stack 3428
      nr_page_table_pages 200
      nr_sec_page_table_pages 1806
      nr_iommu_pages 1806
      nr_swapcached 0
      pgpromote_success 0
      pgpromote_candidate 0
      pgdemote_kswapd 0
      pgdemote_direct 0
      pgdemote_khugepaged 0
      nr_hugetlb   0
  pages free     3840
        boost    0
        min      2
        low      5
        high     8
        promo    11
        spanned  4095
        present  3998
        managed  3840
        cma      0
        protection: (0, 1510, 63945, 63945, 63945)
      nr_free_pages 3840
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      nr_unaccepted 0
      numa_hit     0
      numa_miss    0
      numa_foreign 0
      numa_interleave 0
      numa_local   0
      numa_other   0
  pagesets
    cpu: 0
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 1
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 2
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 3
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 4
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 5
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 6
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 7
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 8
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 9
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 10
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 11
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 12
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 13
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 14
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 15
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 16
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 17
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 18
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 19
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 20
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 21
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 22
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 23
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 24
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 25
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 26
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 27
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 28
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 29
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 30
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
    cpu: 31
              count:    0
              high:     0
              batch:    1
              high_min: 4
              high_max: 30
  vm stats threshold: 12
  node_unreclaimable:  0
  start_pfn:           1
Node 0, zone    DMA32
  pages free     383047
        boost    0
        min      265
        low      651
        high     1037
        promo    1423
        spanned  1044480
        present  403592
        managed  386627
        cma      0
        protection: (0, 0, 62435, 62435, 62435)
      nr_free_pages 383047
      nr_zone_inactive_anon 0
      nr_zone_active_anon 0
      nr_zone_inactive_file 0
      nr_zone_active_file 0
      nr_zone_unevictable 0
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      nr_unaccepted 0
      numa_hit     935941
      numa_miss    6666
      numa_foreign 0
      numa_interleave 0
      numa_local   935941
      numa_other   6666
  pagesets
    cpu: 0
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 1
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 2
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 3
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 4
              count:    190
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 5
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 6
              count:    24
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 7
              count:    143
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 8
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 9
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 10
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 11
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 12
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 13
              count:    0
              high:     0
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 14
              count:    0
              high:     0
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 15
              count:    0
              high:     0
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 16
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 17
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 18
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 19
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 20
              count:    199
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 21
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 22
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 23
              count:    252
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 24
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 25
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 26
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 27
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 28
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 29
              count:    0
              high:     252
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 30
              count:    0
              high:     0
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
    cpu: 31
              count:    0
              high:     0
              batch:    63
              high_min: 252
              high_max: 3020
  vm stats threshold: 60
  node_unreclaimable:  0
  start_pfn:           4096
Node 0, zone   Normal
  pages free     15800625
        boost    0
        min      10958
        low      26941
        high     42924
        promo    58907
        spanned  16252928
        present  16252928
        managed  15983508
        cma      0
        protection: (0, 0, 0, 0, 0)
      nr_free_pages 15800625
      nr_zone_inactive_anon 2442
      nr_zone_active_anon 92
      nr_zone_inactive_file 1178
      nr_zone_active_file 5819
      nr_zone_unevictable 768
      nr_zone_write_pending 0
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      nr_unaccepted 0
      numa_hit     109927768
      numa_miss    698232
      numa_foreign 5735015
      numa_interleave 6476
      numa_local   109925700
      numa_other   700300
  pagesets
    cpu: 0
              count:    471
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 1
              count:    483
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 2
              count:    281
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 3
              count:    1616
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 4
              count:    1317
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 5
              count:    290
              high:     1746
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 6
              count:    662
              high:     1746
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 7
              count:    1627
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 8
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 9
              count:    1116
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 10
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 11
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 12
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 13
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 14
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 15
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 16
              count:    252
              high:     1998
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 17
              count:    445
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 18
              count:    216
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 19
              count:    238
              high:     1809
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 20
              count:    891
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 21
              count:    1556
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 22
              count:    410
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 23
              count:    223
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 24
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 25
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 26
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 27
              count:    39
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 28
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 29
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 30
              count:    31
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
    cpu: 31
              count:    0
              high:     1683
              batch:    63
              high_min: 1683
              high_max: 124871
  vm stats threshold: 120
  node_unreclaimable:  0
  start_pfn:           1048576
Node 0, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        promo    32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 0, zone   Device
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        promo    0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 1, zone      DMA
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        promo    0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 1, zone    DMA32
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        promo    0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 1, zone   Normal
  per-node stats
      nr_inactive_anon 4127
      nr_active_anon 121
      nr_inactive_file 3690
      nr_active_file 10718
      nr_unevictable 0
      nr_slab_reclaimable 6504
      nr_slab_unreclaimable 21331
      nr_isolated_anon 0
      nr_isolated_file 0
      workingset_nodes 146
      workingset_refault_anon 0
      workingset_refault_file 4050
      workingset_activate_anon 0
      workingset_activate_file 0
      workingset_restore_anon 0
      workingset_restore_file 0
      workingset_nodereclaim 0
      nr_anon_pages 4003
      nr_mapped    7490
      nr_file_pages 14661
      nr_dirty     1
      nr_writeback 0
      nr_writeback_temp 0
      nr_shmem     253
      nr_shmem_hugepages 0
      nr_shmem_pmdmapped 0
      nr_file_hugepages 0
      nr_file_pmdmapped 0
      nr_anon_transparent_hugepages 0
      nr_vmscan_write 0
      nr_vmscan_immediate_reclaim 0
      nr_dirtied   3345081
      nr_written   3345076
      nr_throttled_written 0
      nr_kernel_misc_reclaimable 0
      nr_foll_pin_acquired 0
      nr_foll_pin_released 0
      nr_kernel_stack 4824
      nr_page_table_pages 237
      nr_sec_page_table_pages 1806
      nr_iommu_pages 1806
      nr_swapcached 0
      pgpromote_success 0
      pgpromote_candidate 0
      pgdemote_kswapd 0
      pgdemote_direct 0
      pgdemote_khugepaged 0
      nr_hugetlb   0
  pages free     16395139
        boost    0
        min      11301
        low      27783
        high     44265
        promo    60747
        spanned  16777216
        present  16777216
        managed  16490125
        cma      0
        protection: (0, 0, 0, 0, 0)
      nr_free_pages 16395139
      nr_zone_inactive_anon 4127
      nr_zone_active_anon 121
      nr_zone_inactive_file 3690
      nr_zone_active_file 10718
      nr_zone_unevictable 0
      nr_zone_write_pending 1
      nr_mlock     0
      nr_bounce    0
      nr_zspages   0
      nr_free_cma  0
      nr_unaccepted 0
      numa_hit     32499922
      numa_miss    5735015
      numa_foreign 704898
      numa_interleave 6936
      numa_local   32488458
      numa_other   5746479
  pagesets
    cpu: 0
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 1
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 2
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 3
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 4
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 5
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 6
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 7
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 8
              count:    308
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 9
              count:    238
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 10
              count:    127
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 11
              count:    435
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 12
              count:    202
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 13
              count:    288
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 14
              count:    215
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 15
              count:    300
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 16
              count:    11
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 17
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 18
              count:    2
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 19
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 20
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 21
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 22
              count:    31
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 23
              count:    0
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 24
              count:    376
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 25
              count:    930
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 26
              count:    1643
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 27
              count:    1700
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 28
              count:    1540
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 29
              count:    1709
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 30
              count:    1629
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
    cpu: 31
              count:    153
              high:     1736
              batch:    63
              high_min: 1736
              high_max: 128773
  vm stats threshold: 120
  node_unreclaimable:  0
  start_pfn:           17301504
Node 1, zone  Movable
  pages free     0
        boost    0
        min      32
        low      32
        high     32
        promo    32
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)
Node 1, zone   Device
  pages free     0
        boost    0
        min      0
        low      0
        high     0
        promo    0
        spanned  0
        present  0
        managed  0
        cma      0
        protection: (0, 0, 0, 0, 0)

[-- Attachment #3: Type: text/plain, Size: 29 bytes --]


-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 15:57             ` Baoquan He
@ 2025-02-26 17:46               ` Michal Hocko
  2025-02-27  9:41                 ` Baoquan He
  2025-02-27  9:16               ` Vlastimil Babka
  1 sibling, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2025-02-26 17:46 UTC (permalink / raw)
  To: Baoquan He
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On Wed 26-02-25 23:57:48, Baoquan He wrote:
> On 02/26/25 at 01:01pm, Michal Hocko wrote:
> > On Wed 26-02-25 19:51:12, Baoquan He wrote:
> > > On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > > > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > > > [...]
> > > > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > > > protection has anything to do with how higher zone is populated while
> > > > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > > > allocation. Those allocations are independent on the actual memory in
> > > > > > > that zone.
> > > > > > 
> > > > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > > > to NUMA system. While it still only reflects each zone with other zones
> > > > > > within one specific node. We may need take this opportunity to
> > > > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > > > nodes into account.
> > > > > 
> > > > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > > > numa_zonelist_order we used to have in the past and that has been
> > > > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> > > 
> > > Hmm, if Gabriel can provide detailed node/zone information of the
> > > system, we can check if there's anything we can do to adjust
> > > zone->lowmem_reserve[] to reflect its real usage and semantics.
> > 
> > Why do you think anything needs to be adjusted?
> 
> No, I don't think like that. But I am wondering what makes you get
> the conclusion.
> 
> > 
> > > I haven't thought of the whole zone fallback list to interleave nodes
> > > which invovles a lot of change.
> > > 
> > > > 
> > > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > > > changelog doesn't say much about that. 
> > > 
> > > No, no actual problem was observed on tht.
> > 
> > OK
> > 
> > > I was just trying to make
> > > clear the semantics because I was confused by its obscure value printing
> > > of zone->lowmem_reserve[] in /proc/zoneinfo.
> > > 
> > > I think we can merge this reverting firstly, then to investigate how to
> > > better clarify it.
> > 
> > What do you think needs to be clarify? How exactly is the original
> > output confusing?
> 
> When I did the change, I wrote the reason in commit log. I don't think
> you care to read it from your talking. Let me explain again, in
> setup_per_zone_lowmem_reserve(), each zone's protection value is
> calculated based on its own node's zones. E.g below on node 0, its
> Movable zone and Device zone are empty but still show influence on
> Normal/DMA32/DMA zone, this is unreasonable from the protection value
> calculating code and its showing.

You said that in the commit message without explanation why. Also I
claim this is just wrong because the zone's protection is independent on
the size of the zone that it is protected from. I have explained why I
believe but let me reiterate. ZONE_DMA32 should be protected from
GFP_MOVABLE even if the zone movable is empty (same as if it had a
single or many pages). Why? Because, the lowmem reserve protects low
memory allocation requests.

See my point? Is that reasoning clear?

P.S.
I think we can have a more productive discussion without accusations.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 16:05   ` Gabriel Krisman Bertazi
@ 2025-02-26 23:00     ` Andrew Morton
  0 siblings, 0 replies; 19+ messages in thread
From: Andrew Morton @ 2025-02-26 23:00 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Michal Hocko, linux-mm, Mel Gorman, Vlastimil Babka, Baoquan He

On Wed, 26 Feb 2025 11:05:10 -0500 Gabriel Krisman Bertazi <krisman@suse.de> wrote:

> For reference, I collected vmstat with and without this patch on a
> freshly booted system running intensive randread io from an nvme for 5
> minutes. I got:
> 
> rpm-6.12.0-slfo.1.2 ->  pgscan_kswapd 5629543865
> Patched             ->  pgscan_kswapd 33580844
> 
> 33M scans is similar to what we had in kernels predating this patch.
> These numbers is fairly representative of the workload on this machine, as
> measured in several runs.  So we are talking about a 2-order of
> magnitude increase.

Thanks for adding this detail - I pasted it into the changelog.

I'll queue the patch with a cc:stable.  You may wish to send along a
new changelog to fill in additional details resulting from the review
discussion.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 15:57             ` Baoquan He
  2025-02-26 17:46               ` Michal Hocko
@ 2025-02-27  9:16               ` Vlastimil Babka
  2025-02-27 10:24                 ` Baoquan He
  1 sibling, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2025-02-27  9:16 UTC (permalink / raw)
  To: Baoquan He, Michal Hocko
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman

On 2/26/25 16:57, Baoquan He wrote:
> On 02/26/25 at 01:01pm, Michal Hocko wrote:
>> 
>> Why do you think anything needs to be adjusted?
> 
> No, I don't think like that. But I am wondering what makes you get
> the conclusion.
> 
>> 
>> > I haven't thought of the whole zone fallback list to interleave nodes
>> > which invovles a lot of change.
>> > 
>> > > 
>> > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
>> > > changelog doesn't say much about that. 
>> > 
>> > No, no actual problem was observed on tht.
>> 
>> OK
>> 
>> > I was just trying to make
>> > clear the semantics because I was confused by its obscure value printing
>> > of zone->lowmem_reserve[] in /proc/zoneinfo.
>> > 
>> > I think we can merge this reverting firstly, then to investigate how to
>> > better clarify it.
>> 
>> What do you think needs to be clarify? How exactly is the original
>> output confusing?
> 
> When I did the change, I wrote the reason in commit log. I don't think
> you care to read it from your talking. Let me explain again, in
> setup_per_zone_lowmem_reserve(), each zone's protection value is
> calculated based on its own node's zones. E.g below on node 0, its
> Movable zone and Device zone are empty but still show influence on
> Normal/DMA32/DMA zone, this is unreasonable from the protection value
> calculating code and its showing.

It's not unreasonable. A GFP_HIGHUSER_MOVABLE can use up to the Movable
zone, so e.g. the dma32 zone should be protected from such an allocation, so
it has space for GFP_DMA32 restricted allocations.

If no Movable zone exists, but Normal zone does, the result is the
protection will be the same for GFP_KERNEL allocations (that can use up to
the Normal zone) and GFP_HIGHUSER_MOVABLE allocations. (i.e. the number of
22134 in your listing is the same for both indexes). That's fine. But
setting the protection from Movable allocations to 0 as commit 96a5c186efff
did was simply a bug, as that can directly lead to GFP_HIGHUSER_MOVABLE
depleting ZONE_DMA32.

The only "unreasonable" part here is that we define and show protections
from ZONE_DEVICE allocations. The usage of this zone is AFAIK completely
separate from normal page allocation through zonelists, so we could exclude
it, if anyone cared enough.

> If really as your colleague Gabriel said, the protection value of DMA zone
> on node 0 will impact allocation when targeted zone is Movable zone, we
> may need consider the protection value calcuation acorss nodes. Because
> the impact happens among different nodes. I only said we can do
> investigation, I didn't said we need change or have to change.

There might be a theoretical issue if e.g. Node 0 only contained DMA and
DMA32 zones and nothing else, while the Normal zone is on Node 1, there
would be no protection for DMA/DMA32 zones from Normal allocations, as
setup_per_zone_lowmem_reserve() considers each node separately and thus
would not take Normal zone size from Node 1 into account.

Should we sum zone sizes accross all nodes then? But then __GFP_THISNODE
Normal allocations for node 0 would never succeed? Or we'd need a separate
lowmem_reserve array for those?

I guess the issue doesn't happen in practice. In any case it's out of scope
of the reverted commit and the revert.

> Node 0, zone      DMA
>   ......
>   pages free     2816
>         ......
>         protection: (0, 1582, 23716, 23716, 23716)
> Node 0, zone    DMA32
>   pages free     403269
>         ......
>         protection: (0, 0, 22134, 22134, 22134)
> Node 0, zone   Normal
>   pages free     5423879
>         ......
>         protection: (0, 0, 0, 0, 0)
> Node 0, zone  Movable
>   pages free     0
>         ......
>         protection: (0, 0, 0, 0, 0)
> Node 0, zone   Device
>   pages free     0
>         ......
>         protection: (0, 0, 0, 0, 0)
> 
>> 
>> -- 
>> Michal Hocko
>> SUSE Labs
>> 
> 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26 17:46               ` Michal Hocko
@ 2025-02-27  9:41                 ` Baoquan He
  0 siblings, 0 replies; 19+ messages in thread
From: Baoquan He @ 2025-02-27  9:41 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman, Vlastimil Babka

On 02/26/25 at 06:46pm, Michal Hocko wrote:
> On Wed 26-02-25 23:57:48, Baoquan He wrote:
> > On 02/26/25 at 01:01pm, Michal Hocko wrote:
> > > On Wed 26-02-25 19:51:12, Baoquan He wrote:
> > > > On 02/26/25 at 12:00pm, Michal Hocko wrote:
> > > > > On Wed 26-02-25 11:52:41, Michal Hocko wrote:
> > > > > > On Wed 26-02-25 18:00:26, Baoquan He wrote:
> > > > > > > On 02/26/25 at 07:54am, Michal Hocko wrote:
> > > > > > [...]
> > > > > > > > In any case 96a5c186efff seems incorrect because it assumes that the
> > > > > > > > protection has anything to do with how higher zone is populated while
> > > > > > > > the protection fundamentaly protects lower zone from higher zones
> > > > > > > > allocation. Those allocations are independent on the actual memory in
> > > > > > > > that zone.
> > > > > > > 
> > > > > > > The protection value was introduced in non-NUMA time, and later adapted
> > > > > > > to NUMA system. While it still only reflects each zone with other zones
> > > > > > > within one specific node. We may need take this opportunity to
> > > > > > > reconsider it, e.g in the FALLBACK zonelists case it needs take crossing
> > > > > > > nodes into account.
> > > > > > 
> > > > > > Are you suggesting zone fallback list to interleave nodes? I.e.
> > > > > > numa_zonelist_order we used to have in the past and that has been
> > > > > > removed by c9bff3eebc09 ("mm, page_alloc: rip out ZONELIST_ORDER_ZONE").
> > > > 
> > > > Hmm, if Gabriel can provide detailed node/zone information of the
> > > > system, we can check if there's anything we can do to adjust
> > > > zone->lowmem_reserve[] to reflect its real usage and semantics.
> > > 
> > > Why do you think anything needs to be adjusted?
> > 
> > No, I don't think like that. But I am wondering what makes you get
> > the conclusion.
> > 
> > > 
> > > > I haven't thought of the whole zone fallback list to interleave nodes
> > > > which invovles a lot of change.
> > > > 
> > > > > 
> > > > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> > > > > changelog doesn't say much about that. 
> > > > 
> > > > No, no actual problem was observed on tht.
> > > 
> > > OK
> > > 
> > > > I was just trying to make
> > > > clear the semantics because I was confused by its obscure value printing
> > > > of zone->lowmem_reserve[] in /proc/zoneinfo.
> > > > 
> > > > I think we can merge this reverting firstly, then to investigate how to
> > > > better clarify it.
> > > 
> > > What do you think needs to be clarify? How exactly is the original
> > > output confusing?
> > 
> > When I did the change, I wrote the reason in commit log. I don't think
> > you care to read it from your talking. Let me explain again, in
> > setup_per_zone_lowmem_reserve(), each zone's protection value is
> > calculated based on its own node's zones. E.g below on node 0, its
> > Movable zone and Device zone are empty but still show influence on
> > Normal/DMA32/DMA zone, this is unreasonable from the protection value
> > calculating code and its showing.
> 
> You said that in the commit message without explanation why. Also I
> claim this is just wrong because the zone's protection is independent on
> the size of the zone that it is protected from. I have explained why I
> believe but let me reiterate. ZONE_DMA32 should be protected from
> GFP_MOVABLE even if the zone movable is empty (same as if it had a
> single or many pages). Why? Because, the lowmem reserve protects low
> memory allocation requests.
> 
> See my point? Is that reasoning clear?

Very clear. But now the protection is calculated node by node. Please
think about one case, Node 0 only has ZONE_DMA and ZONE_DMA32, Node 1
and 2, 3 ...N have NORMAL_ZONE and MOVABLE_ZONE. How could ZONE_DMA32
be protected from GFP_MOVABLE? Linux kernel has restriction on the node
layout where Node 0 can't do this? Especailly on arm64, there's only
ZONE_DMA and its boundary is not fixed some time, what if system vendor
arranges the Node 0 only having ZONE_DMA?

Secondly, the existing protection ratio was created based on the old x86
system. It may not be fit for the current ARCH, e.g arm64, it only has
ZONE_DMA which is under 4G by default, the default ratio obviously not
suitable any more. And we can clearly feel that the current protection
value is for __GFP_THISNODE allocation.

======
/*                                      
 * results with 256, 32 in the lowmem_reserve sysctl:
 *      1G machine -> (16M dma, 800M-16M normal, 1G-800M high)
 *      1G machine -> (16M dma, 784M normal, 224M high)
 *      NORMAL allocation will leave 784M/256 of ram reserved in the ZONE_DMA
 *      HIGHMEM allocation will leave 224M/32 of ram reserved in ZONE_NORMAL
 *      HIGHMEM allocation will leave (224M+784M)/256 of ram reserved in ZONE_DMA
 *
 * TBD: should special case ZONE_DMA32 machines here - in those we normally
 * don't need any ZONE_NORMAL reservation
 */
static int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = {
======

So my thought is we either have two-dimention protection value, one is for
__GFP_THISNODE allocation, 2nd dimention is for FALLBACK allocation.
struct zone {
	......
	long lowmem_reserve[MAX_ZONELISTS][MAX_NR_ZONES];
}
Or we count in one higher zone's amount in all nodes when calculating
the proction value for lower zone, while the formula need be adjusted
because the one zone's calculated page number could be huge and the the
protections value could be bigger than the lower zone's page number.

Or leave it until one true problem occur, we can consider to fix it
accordingly.

Any one is fine to me.

> 
> P.S.
> I think we can have a more productive discussion without accusations.

Yes, we can, and I have no doubt about it always, no matter when and
with whom.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-27  9:16               ` Vlastimil Babka
@ 2025-02-27 10:24                 ` Baoquan He
  2025-02-27 13:16                   ` Vlastimil Babka
  0 siblings, 1 reply; 19+ messages in thread
From: Baoquan He @ 2025-02-27 10:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Gabriel Krisman Bertazi, akpm, linux-mm, Mel Gorman

On 02/27/25 at 10:16am, Vlastimil Babka wrote:
> On 2/26/25 16:57, Baoquan He wrote:
> > On 02/26/25 at 01:01pm, Michal Hocko wrote:
> >> 
> >> Why do you think anything needs to be adjusted?
> > 
> > No, I don't think like that. But I am wondering what makes you get
> > the conclusion.
> > 
> >> 
> >> > I haven't thought of the whole zone fallback list to interleave nodes
> >> > which invovles a lot of change.
> >> > 
> >> > > 
> >> > > Btw. has 96a5c186efff tried to fix any actual runtime problem? The
> >> > > changelog doesn't say much about that. 
> >> > 
> >> > No, no actual problem was observed on tht.
> >> 
> >> OK
> >> 
> >> > I was just trying to make
> >> > clear the semantics because I was confused by its obscure value printing
> >> > of zone->lowmem_reserve[] in /proc/zoneinfo.
> >> > 
> >> > I think we can merge this reverting firstly, then to investigate how to
> >> > better clarify it.
> >> 
> >> What do you think needs to be clarify? How exactly is the original
> >> output confusing?
> > 
> > When I did the change, I wrote the reason in commit log. I don't think
> > you care to read it from your talking. Let me explain again, in
> > setup_per_zone_lowmem_reserve(), each zone's protection value is
> > calculated based on its own node's zones. E.g below on node 0, its
> > Movable zone and Device zone are empty but still show influence on
> > Normal/DMA32/DMA zone, this is unreasonable from the protection value
> > calculating code and its showing.

Ah, I saw your mail when I finished my replying to Michal. Thanks for
your sharing with deliberate details, I almost agree with them all.

> 
> It's not unreasonable. A GFP_HIGHUSER_MOVABLE can use up to the Movable
> zone, so e.g. the dma32 zone should be protected from such an allocation, so
> it has space for GFP_DMA32 restricted allocations.

Yes, I didn't realize that when I did the change in commit 96a5c186efff,
sorry about that.

> 
> If no Movable zone exists, but Normal zone does, the result is the
> protection will be the same for GFP_KERNEL allocations (that can use up to
> the Normal zone) and GFP_HIGHUSER_MOVABLE allocations. (i.e. the number of
> 22134 in your listing is the same for both indexes). That's fine. But
> setting the protection from Movable allocations to 0 as commit 96a5c186efff
> did was simply a bug, as that can directly lead to GFP_HIGHUSER_MOVABLE
> depleting ZONE_DMA32.

Yes, agree. I think that's the reason Gabriel observed the regression.

> 
> The only "unreasonable" part here is that we define and show protections
> from ZONE_DEVICE allocations. The usage of this zone is AFAIK completely
> separate from normal page allocation through zonelists, so we could exclude
> it, if anyone cared enough.

I think this is not the only unreasonable part.
sysctl_lowmem_reserve_ratio is a knob provided for user to tune the
memory management. While the underlying code relative to the set ratio
can't meet the expections. Even though we revert my patch, it seems to
work well, while the protection value is not under good management. It
just happens to work. Because the protection value is calculated
relative to __GFP_THISNODE allocation, while ignoring the FALLBACK
allocation. I think you have pointed that out greatly in below comment.

> 
> > If really as your colleague Gabriel said, the protection value of DMA zone
> > on node 0 will impact allocation when targeted zone is Movable zone, we
> > may need consider the protection value calcuation acorss nodes. Because
> > the impact happens among different nodes. I only said we can do
> > investigation, I didn't said we need change or have to change.
> 
> There might be a theoretical issue if e.g. Node 0 only contained DMA and
> DMA32 zones and nothing else, while the Normal zone is on Node 1, there
> would be no protection for DMA/DMA32 zones from Normal allocations, as
> setup_per_zone_lowmem_reserve() considers each node separately and thus
> would not take Normal zone size from Node 1 into account.
> 
> Should we sum zone sizes accross all nodes then? But then __GFP_THISNODE
> Normal allocations for node 0 would never succeed? Or we'd need a separate
> lowmem_reserve array for those?

Yeah, I have the same thought as you here. We may need adapation here.

> 
> I guess the issue doesn't happen in practice. In any case it's out of scope
> of the reverted commit and the revert.

It could happen on arm64 because arm64 only has ZONE_DMA by default and
its boundary is not fixed. I saw all zones are ZONE_DMA on arm64, I
guess it could be easier to see a arm64 system which only has ZONE_DMA
on node 0 and ZONE_NORMAL/MOVABLE on other nodes.

> 
> > Node 0, zone      DMA
> >   ......
> >   pages free     2816
> >         ......
> >         protection: (0, 1582, 23716, 23716, 23716)
> > Node 0, zone    DMA32
> >   pages free     403269
> >         ......
> >         protection: (0, 0, 22134, 22134, 22134)
> > Node 0, zone   Normal
> >   pages free     5423879
> >         ......
> >         protection: (0, 0, 0, 0, 0)
> > Node 0, zone  Movable
> >   pages free     0
> >         ......
> >         protection: (0, 0, 0, 0, 0)
> > Node 0, zone   Device
> >   pages free     0
> >         ......
> >         protection: (0, 0, 0, 0, 0)
> > 
> >> 
> >> -- 
> >> Michal Hocko
> >> SUSE Labs
> >> 
> > 
> > 
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-26  3:22 [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" Gabriel Krisman Bertazi
  2025-02-26  6:54 ` Michal Hocko
  2025-02-26 13:00 ` Vlastimil Babka
@ 2025-02-27 11:50 ` Mel Gorman
  2 siblings, 0 replies; 19+ messages in thread
From: Mel Gorman @ 2025-02-27 11:50 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: akpm, linux-mm, Michal Hocko, Vlastimil Babka, Baoquan He

On Tue, Feb 25, 2025 at 10:22:58PM -0500, Gabriel Krisman Bertazi wrote:
> Commit 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's
> ->lowmem_reserve[] for empty zone") removes the protection of lower
> zones from allocations targeting memory-less high zones.  This had an
> unintended impact on the pattern of reclaims because it makes the
> high-zone-targeted allocation more likely to succeed in lower zones,
> which adds pressure to said zones.  I.e, the following corresponding
> checks in zone_watermark_ok/zone_watermark_fast are less likely to
> trigger:
> 
>         if (free_pages <= min + z->lowmem_reserve[highest_zoneidx])
>                 return false;
> 
> As a result, we are observing an increase in reclaim and kswapd scans,
> due to the increased pressure.  This was initially observed as increased
> latency in filesystem operations when benchmarking with fio on a machine
> with some memory-less zones, but it has since been associated with
> increased contention in locks related to memory reclaim.  By reverting
> this patch, the original performance was recovered on that machine.
> 
> The original commit was introduced as a clarification of the
> /proc/zoneinfo output, so it doesn't seem there are usecases depending
> on it, making the revert a simple solution.
> 
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Baoquan He <bhe@redhat.com>
> Fixes: 96a5c186efff ("mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone")
> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-27 10:24                 ` Baoquan He
@ 2025-02-27 13:16                   ` Vlastimil Babka
  2025-02-27 15:53                     ` Baoquan He
  0 siblings, 1 reply; 19+ messages in thread
From: Vlastimil Babka @ 2025-02-27 13:16 UTC (permalink / raw)
  To: Baoquan He
  Cc: Michal Hocko, Gabriel Krisman Bertazi, akpm, linux-mm,
	Mel Gorman, Petr Tesarik

On 2/27/25 11:24, Baoquan He wrote:
>> I guess the issue doesn't happen in practice. In any case it's out of scope
>> of the reverted commit and the revert.
> It could happen on arm64 because arm64 only has ZONE_DMA by default and
> its boundary is not fixed. I saw all zones are ZONE_DMA on arm64, I
> guess it could be easier to see a arm64 system which only has ZONE_DMA
> on node 0 and ZONE_NORMAL/MOVABLE on other nodes.

Does it mean the ZONE_DMA is rather large then on arm64 then? In that case
things probably works fine even if no protection is applied to it. The x86
ones are small and thus need(ed) it much more. So I don't think we
proactively need to change anything unless there are known issues observed
in practice.

Another reason to avoid the effort is that hopefully we'll get rid of the
DMA zones anyway? They don't work all that well anyway in modern times.
Ccing Petr for awareness (due to his recent LPC talk about this topic)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone"
  2025-02-27 13:16                   ` Vlastimil Babka
@ 2025-02-27 15:53                     ` Baoquan He
  0 siblings, 0 replies; 19+ messages in thread
From: Baoquan He @ 2025-02-27 15:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, Gabriel Krisman Bertazi, akpm, linux-mm,
	Mel Gorman, Petr Tesarik

On 02/27/25 at 02:16pm, Vlastimil Babka wrote:
> On 2/27/25 11:24, Baoquan He wrote:
> >> I guess the issue doesn't happen in practice. In any case it's out of scope
> >> of the reverted commit and the revert.
> > It could happen on arm64 because arm64 only has ZONE_DMA by default and
> > its boundary is not fixed. I saw all zones are ZONE_DMA on arm64, I
> > guess it could be easier to see a arm64 system which only has ZONE_DMA
> > on node 0 and ZONE_NORMAL/MOVABLE on other nodes.
> 
> Does it mean the ZONE_DMA is rather large then on arm64 then? In that case

Hmm, means it's more likely happening on arm64 that there's only
ZONE_DMA on node0 because its node/zone layout could more flexibly
customized in firmware by ystem vendor, but not like x86_64 with
fixed range of ZONE_DMA, ZONE_DMA32 and there's always ZONE_NORMAL in
node0.

> things probably works fine even if no protection is applied to it. The x86
> ones are small and thus need(ed) it much more. So I don't think we
> proactively need to change anything unless there are known issues observed
> in practice.

I am fine if we decide to leave it since we have made clear the root
cause and all the potential issues. Once known issue reported, we can
do the change at that time.

> 
> Another reason to avoid the effort is that hopefully we'll get rid of the
> DMA zones anyway? They don't work all that well anyway in modern times.
> Ccing Petr for awareness (due to his recent LPC talk about this topic)

Thanks for telling. I noticed Petr's interesting presentation in
LPC 2024, that sounds very stunning but very attractive if it's
finally achieved. But I love it. I think that's a good one to vote
for not touching the proctection value for now. 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-02-27 15:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-26  3:22 [PATCH] Revert "mm/page_alloc.c: don't show protection in zone's ->lowmem_reserve[] for empty zone" Gabriel Krisman Bertazi
2025-02-26  6:54 ` Michal Hocko
2025-02-26 10:00   ` Baoquan He
2025-02-26 10:52     ` Michal Hocko
2025-02-26 11:00       ` Michal Hocko
2025-02-26 11:51         ` Baoquan He
2025-02-26 12:01           ` Michal Hocko
2025-02-26 15:57             ` Baoquan He
2025-02-26 17:46               ` Michal Hocko
2025-02-27  9:41                 ` Baoquan He
2025-02-27  9:16               ` Vlastimil Babka
2025-02-27 10:24                 ` Baoquan He
2025-02-27 13:16                   ` Vlastimil Babka
2025-02-27 15:53                     ` Baoquan He
2025-02-26 13:07   ` Vlastimil Babka
2025-02-26 16:05   ` Gabriel Krisman Bertazi
2025-02-26 23:00     ` Andrew Morton
2025-02-26 13:00 ` Vlastimil Babka
2025-02-27 11:50 ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox