[PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active
@ 2024-11-26 15:06 Seiji Nishikawa
  2024-11-28  0:49 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-26 15:06 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, snishika

Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
zone_page_state_snapshot"), a task may remain indefinitely stuck in
throttle_direct_reclaim() while holding mm->rwsem.

__alloc_pages_nodemask
 try_to_free_pages
  throttle_direct_reclaim

This can cause numerous other tasks to wait on the same rwsem, leading
to severe system hangups:

[1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
[1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
[1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
[1088963.381869] Call trace:
[1088963.381872]  __switch_to+0xd0/0x120
[1088963.381877]  __schedule+0x340/0xac8
[1088963.381881]  schedule+0x68/0x118
[1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8

The issue arises when allow_direct_reclaim(pgdat) returns false,
preventing progress even when the pgdat->pfmemalloc_wait wait queue is
empty. Despite the wait queue being empty, the condition,
allow_direct_reclaim(pgdat), may still be returning false, causing it to
continue looping.

In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
 > 0), but calculations of pfmemalloc_reserve and free_pages result in
wmark_ok being false.

And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
is not woken up, further exacerbating the problem:

crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
$775 = __MAX_NR_ZONES

This patch modifies allow_direct_reclaim() to wake kswapd if the
pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
true or false. This change ensures kswapd does not miss wake-ups under
high memory pressure, reducing the risk of task stalls in the throttled
reclaim path.

Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..b1b3e5a116a8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)

 	wmark_ok = free_pages > pfmemalloc_reserve / 2;

-	/* kswapd must be awake if processes are being throttled */
-	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+	/* Always wake up kswapd if the wait queue is not empty */
+	if (waitqueue_active(&pgdat->kswapd_wait)) {
 		if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
 			WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);

-- 
2.47.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active
  2024-11-26 15:06 [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active Seiji Nishikawa
@ 2024-11-28  0:49 ` Andrew Morton
  2024-11-29  4:39   ` Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2024-11-28  0:49 UTC (permalink / raw)
  To: Seiji Nishikawa; +Cc: linux-mm, linux-kernel, Mel Gorman

On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:

> Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> zone_page_state_snapshot"), a task may remain indefinitely stuck in
> throttle_direct_reclaim() while holding mm->rwsem.
> 
> __alloc_pages_nodemask
>  try_to_free_pages
>   throttle_direct_reclaim
> 
> This can cause numerous other tasks to wait on the same rwsem, leading
> to severe system hangups:
> 
> [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> [1088963.381869] Call trace:
> [1088963.381872]  __switch_to+0xd0/0x120
> [1088963.381877]  __schedule+0x340/0xac8
> [1088963.381881]  schedule+0x68/0x118
> [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> 
> The issue arises when allow_direct_reclaim(pgdat) returns false,
> preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> empty. Despite the wait queue being empty, the condition,
> allow_direct_reclaim(pgdat), may still be returning false, causing it to
> continue looping.
> 
> In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
>  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> wmark_ok being false.
> 
> And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> is not woken up, further exacerbating the problem:
> 
> crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> $775 = __MAX_NR_ZONES
> 
> This patch modifies allow_direct_reclaim() to wake kswapd if the
> pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> true or false. This change ensures kswapd does not miss wake-ups under
> high memory pressure, reducing the risk of task stalls in the throttled
> reclaim path.

The code which is being altered is over 10 years old.  

Is this misbehavior more recent?  If so, are we able to identify which
commit caused this?

Otherwise, can you suggest why it took so long for this to be
discovered?  Your test case must be doing something unusual?

Thanks.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
>  
>  	wmark_ok = free_pages > pfmemalloc_reserve / 2;
>  
> -	/* kswapd must be awake if processes are being throttled */
> -	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> +	/* Always wake up kswapd if the wait queue is not empty */
> +	if (waitqueue_active(&pgdat->kswapd_wait)) {
>  		if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
>  			WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
>  



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active
  2024-11-28  0:49 ` Andrew Morton
@ 2024-11-29  4:39   ` Seiji Nishikawa
  2024-11-30 16:12     ` [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim() Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-29  4:39 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mgorman, snishika

On Thu, Nov 28, 2024 at 9:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:
>
> > Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> > zone_page_state_snapshot"), a task may remain indefinitely stuck in
> > throttle_direct_reclaim() while holding mm->rwsem.
> >
> > __alloc_pages_nodemask
> >  try_to_free_pages
> >   throttle_direct_reclaim
> >
> > This can cause numerous other tasks to wait on the same rwsem, leading
> > to severe system hangups:
> >
> > [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> > [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> > [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> > [1088963.381869] Call trace:
> > [1088963.381872]  __switch_to+0xd0/0x120
> > [1088963.381877]  __schedule+0x340/0xac8
> > [1088963.381881]  schedule+0x68/0x118
> > [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> >
> > The issue arises when allow_direct_reclaim(pgdat) returns false,
> > preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> > empty. Despite the wait queue being empty, the condition,
> > allow_direct_reclaim(pgdat), may still be returning false, causing it to
> > continue looping.
> >
> > In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
> >  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> > wmark_ok being false.
> >
> > And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> > is not woken up, further exacerbating the problem:
> >
> > crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> > $775 = __MAX_NR_ZONES
> >
> > This patch modifies allow_direct_reclaim() to wake kswapd if the
> > pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> > true or false. This change ensures kswapd does not miss wake-ups under
> > high memory pressure, reducing the risk of task stalls in the throttled
> > reclaim path.
>
> The code which is being altered is over 10 years old.
>
> Is this misbehavior more recent?  If so, are we able to identify which
> commit caused this?

The issue is not new but may have become more noticeable after commit 
501b26510ae3, which improved precision in allow_direct_reclaim(). This 
change exposed edge cases where wmark_ok is false despite reclaimable 
pages being available.

> Otherwise, can you suggest why it took so long for this to be
> discovered?  Your test case must be doing something unusual?

The issue likely occurs under specific conditions: high memory pressure 
with frequent direct reclaim, contention on mmap_sem from concurrent 
memory allocations, reclaimable pages exist, but zone states cause 
wmark_ok to return false.

Modern workloads (e.g., Python multiprocessing) and changes in kernel 
reclaim logic may have surfaced such edge cases more prominently than 
before.

The workload involves concurrent Python processes under high memory 
pressure, leading to contention on mmap_sem. While not unusual, this 
workload may trigger a rare combination of conditions that expose the 
issue.

>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
> >
> >       wmark_ok = free_pages > pfmemalloc_reserve / 2;
> >
> > -     /* kswapd must be awake if processes are being throttled */
> > -     if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > +     /* Always wake up kswapd if the wait queue is not empty */
> > +     if (waitqueue_active(&pgdat->kswapd_wait)) {
> >               if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
> >                       WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
> >



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
  2024-11-29  4:39   ` Seiji Nishikawa
@ 2024-11-30 16:12     ` Seiji Nishikawa
  2024-11-30 16:12       ` Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-30 16:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mgorman, snishika

On Fri, Nov 29, 2024 at 1:39 PM Seiji Nishikawa <snishika@redhat.com> wrote:
>
> On Thu, Nov 28, 2024 at 9:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:
> >
> > > Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> > > zone_page_state_snapshot"), a task may remain indefinitely stuck in
> > > throttle_direct_reclaim() while holding mm->rwsem.
> > >
> > > __alloc_pages_nodemask
> > >  try_to_free_pages
> > >   throttle_direct_reclaim
> > >
> > > This can cause numerous other tasks to wait on the same rwsem, leading
> > > to severe system hangups:
> > >
> > > [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> > > [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> > > [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> > > [1088963.381869] Call trace:
> > > [1088963.381872]  __switch_to+0xd0/0x120
> > > [1088963.381877]  __schedule+0x340/0xac8
> > > [1088963.381881]  schedule+0x68/0x118
> > > [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> > >
> > > The issue arises when allow_direct_reclaim(pgdat) returns false,
> > > preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> > > empty. Despite the wait queue being empty, the condition,
> > > allow_direct_reclaim(pgdat), may still be returning false, causing it to
> > > continue looping.
> > >
> > > In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
> > >  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> > > wmark_ok being false.
> > >
> > > And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> > > is not woken up, further exacerbating the problem:
> > >
> > > crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> > > $775 = __MAX_NR_ZONES
> > >
> > > This patch modifies allow_direct_reclaim() to wake kswapd if the
> > > pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> > > true or false. This change ensures kswapd does not miss wake-ups under
> > > high memory pressure, reducing the risk of task stalls in the throttled
> > > reclaim path.
> >
> > The code which is being altered is over 10 years old.
> >
> > Is this misbehavior more recent?  If so, are we able to identify which
> > commit caused this?
>
> The issue is not new but may have become more noticeable after commit
> 501b26510ae3, which improved precision in allow_direct_reclaim(). This
> change exposed edge cases where wmark_ok is false despite reclaimable
> pages being available.
>
> > Otherwise, can you suggest why it took so long for this to be
> > discovered?  Your test case must be doing something unusual?
>
> The issue likely occurs under specific conditions: high memory pressure
> with frequent direct reclaim, contention on mmap_sem from concurrent
> memory allocations, reclaimable pages exist, but zone states cause
> wmark_ok to return false.
>
> Modern workloads (e.g., Python multiprocessing) and changes in kernel
> reclaim logic may have surfaced such edge cases more prominently than
> before.
>
> The workload involves concurrent Python processes under high memory
> pressure, leading to contention on mmap_sem. While not unusual, this
> workload may trigger a rare combination of conditions that expose the
> issue.
>
> >
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
> > >
> > >       wmark_ok = free_pages > pfmemalloc_reserve / 2;
> > >
> > > -     /* kswapd must be awake if processes are being throttled */
> > > -     if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > > +     /* Always wake up kswapd if the wait queue is not empty */
> > > +     if (waitqueue_active(&pgdat->kswapd_wait)) {
> > >               if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
> > >                       WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
> > >
>

Through further extensive debugging, it has been revealed that the 
interpretation that kswapd was not woken up even when 
(!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) held true was 
incorrect. 

Every time kswapd() runs, it overwrites pgdat->kswapd_highest_zoneidx 
with MAX_NR_ZONES, hence it is __MAX_NR_ZONES just at the time when this
dump is captured.

The task continues looping in throttle_direct_reclaim() because 
allow_direct_reclaim(pgdat) keeps returning false. 

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of 
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero. 

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on 
to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 
having free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced. The node balancing logic in 
balance_pgdat() evaluates all zones collectively. If one or more zones 
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the 
entire node is deemed balanced. This causes balance_pgdat() to exit 
early before incrementing the kswapd_failures, as it considers the 
overall memory state acceptable, even though some zones (like 
ZONE_NORMAL) remain under significant pressure.

The new patch ensures that zone_reclaimable_pages() includes free pages 
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are 
available (e.g., file-backed or anonymous pages). This change prevents 
zones like ZONE_DMA32, which have sufficient free pages, from being 
mistakenly deemed unreclaimable. By doing so, the patch ensures proper 
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by 
allow_direct_reclaim(pgdat) repeatedly returning false.




^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
  2024-11-30 16:12     ` [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim() Seiji Nishikawa
@ 2024-11-30 16:12       ` Seiji Nishikawa
  2024-12-01  2:40         ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-30 16:12 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mgorman, snishika

The kernel hangs due to a task stuck in throttle_direct_reclaim(),
caused by a node being incorrectly deemed balanced despite pressure in
certain zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.

The lack of swap or reclaimable pages results in ZONE_DMA32 being
ignored during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().

This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures
zones with sufficient free pages are not skipped, enabling proper
balancing and reclaim behavior.

Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
---
 mm/vmscan.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..fb6b4056dcce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
+	/*
+	 * If there are no reclaimable file-backed or anonymous pages, 
+	 * ensure zones with sufficient free pages are not skipped. 
+	 * This prevents zones like DMA32 from being ignored in reclaim 
+	 * scenarios where they can still help alleviate memory pressure.
+	 */
+	if (nr == 0)
+	    nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
 	return nr;
 }

-- 
2.47.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
  2024-11-30 16:12       ` Seiji Nishikawa
@ 2024-12-01  2:40         ` Andrew Morton
  2024-12-01  4:11           ` Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2024-12-01  2:40 UTC (permalink / raw)
  To: Seiji Nishikawa; +Cc: linux-mm, linux-kernel, mgorman

On Sun,  1 Dec 2024 01:12:34 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:

> The kernel hangs due to a task stuck in throttle_direct_reclaim(),
> caused by a node being incorrectly deemed balanced despite pressure in
> certain zones, such as ZONE_NORMAL. This issue arises from
> zone_reclaimable_pages() returning 0 for zones without reclaimable file-
> backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
> free pages to be skipped.
> 
> The lack of swap or reclaimable pages results in ZONE_DMA32 being
> ignored during reclaim, masking pressure in other zones. Consequently,
> pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
> mechanisms in allow_direct_reclaim() from being triggered, leading to an
> infinite loop in throttle_direct_reclaim().
> 
> This patch modifies zone_reclaimable_pages() to account for free pages
> (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures
> zones with sufficient free pages are not skipped, enabling proper
> balancing and reclaim behavior.

We'll want to backport a fix for this into -stable kernels.  For that
it's best to be able to identify a suitable Fixes: target, to tell
others whether their kernel needs the fix.  Are you able to help
identify that commit?

Thanks.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
>  	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
>  		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
>  			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
> -
> +	/*
> +	 * If there are no reclaimable file-backed or anonymous pages, 
> +	 * ensure zones with sufficient free pages are not skipped. 
> +	 * This prevents zones like DMA32 from being ignored in reclaim 
> +	 * scenarios where they can still help alleviate memory pressure.
> +	 */
> +	if (nr == 0)
> +	    nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
>  	return nr;
>  }
>  
> -- 
> 2.47.0


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
  2024-12-01  2:40         ` Andrew Morton
@ 2024-12-01  4:11           ` Seiji Nishikawa
  0 siblings, 0 replies; 9+ messages in thread
From: Seiji Nishikawa @ 2024-12-01  4:11 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, mgorman, snishika

On Sun, Dec 1, 2024 at 11:40 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Sun,  1 Dec 2024 01:12:34 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:
>
> > The kernel hangs due to a task stuck in throttle_direct_reclaim(),
> > caused by a node being incorrectly deemed balanced despite pressure in
> > certain zones, such as ZONE_NORMAL. This issue arises from
> > zone_reclaimable_pages() returning 0 for zones without reclaimable file-
> > backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
> > free pages to be skipped.
> >
> > The lack of swap or reclaimable pages results in ZONE_DMA32 being
> > ignored during reclaim, masking pressure in other zones. Consequently,
> > pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
> > mechanisms in allow_direct_reclaim() from being triggered, leading to an
> > infinite loop in throttle_direct_reclaim().
> >
> > This patch modifies zone_reclaimable_pages() to account for free pages
> > (NR_FREE_PAGES) when no other reclaimable pages exist. This ensures
> > zones with sufficient free pages are not skipped, enabling proper
> > balancing and reclaim behavior.
>
> We'll want to backport a fix for this into -stable kernels.  For that
> it's best to be able to identify a suitable Fixes: target, to tell
> others whether their kernel needs the fix.  Are you able to help
> identify that commit?

Based on my analysis, the issue appears to be fundamentally rooted in 
the original design of zone_reclaimable_pages(). The subsequent change 
introduced with a2a36488a61c ("mm/vmscan: Consider anonymous pages 
without swap") does not fundamentally alter the behavior but it just 
refines the handling of anonymous pages. It does not account for zones 
with sufficient free pages but no reclaimable file-backed or anonymous 
pages. The relevant commit that introduced this function is:

Fixes: 5a1c84b404a7 ("mm: remove reclaim and compaction retry approximations")

This commit seems to be the most appropriate target for the Fixes: tag,
as it introduced the logic that my patch modifies to address the 
observed kernel hang.

>
> Thanks.
>
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
> >       if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
> >               nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
> >                       zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
> > -
> > +     /*
> > +      * If there are no reclaimable file-backed or anonymous pages,
> > +      * ensure zones with sufficient free pages are not skipped.
> > +      * This prevents zones like DMA32 from being ignored in reclaim
> > +      * scenarios where they can still help alleviate memory pressure.
> > +      */
> > +     if (nr == 0)
> > +         nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
> >       return nr;
> >  }
> >
> > --
> > 2.47.0
>



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
  2024-11-30 16:43 Seiji Nishikawa
@ 2024-11-30 16:43 ` Seiji Nishikawa
  0 siblings, 0 replies; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-30 16:43 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, mgorman, snishika

The kernel hangs due to a task stuck in throttle_direct_reclaim(),
caused by a node being incorrectly deemed balanced despite pressure in
certain zones, such as ZONE_NORMAL. This issue arises from
zone_reclaimable_pages() returning 0 for zones without reclaimable file-
backed or anonymous pages, causing zones like ZONE_DMA32 with sufficient
free pages to be skipped.

The lack of swap or reclaimable pages results in ZONE_DMA32 being
ignored during reclaim, masking pressure in other zones. Consequently,
pgdat->kswapd_failures remains 0 in balance_pgdat(), preventing fallback
mechanisms in allow_direct_reclaim() from being triggered, leading to an
infinite loop in throttle_direct_reclaim().

This patch modifies zone_reclaimable_pages() to account for free pages
(NR_FREE_PAGES) when no other reclaimable pages exist. This ensures
zones with sufficient free pages are not skipped, enabling proper
balancing and reclaim behavior.

Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
---
 mm/vmscan.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..fb6b4056dcce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -374,7 +374,14 @@ unsigned long zone_reclaimable_pages(struct zone *zone)
 	if (can_reclaim_anon_pages(NULL, zone_to_nid(zone), NULL))
 		nr += zone_page_state_snapshot(zone, NR_ZONE_INACTIVE_ANON) +
 			zone_page_state_snapshot(zone, NR_ZONE_ACTIVE_ANON);
-
+	/*
+	 * If there are no reclaimable file-backed or anonymous pages, 
+	 * ensure zones with sufficient free pages are not skipped. 
+	 * This prevents zones like DMA32 from being ignored in reclaim 
+	 * scenarios where they can still help alleviate memory pressure.
+	 */
+	if (nr == 0)
+	    nr = zone_page_state_snapshot(zone, NR_FREE_PAGES);
 	return nr;
 }

-- 
2.47.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
@ 2024-11-30 16:43 Seiji Nishikawa
  2024-11-30 16:43 ` Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-30 16:43 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, mgorman, snishika

The task sometimes continues looping in throttle_direct_reclaim() 
because allow_direct_reclaim(pgdat) keeps returning false. 

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of 
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero. 

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on 
to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 
having free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced. The node balancing logic in 
balance_pgdat() evaluates all zones collectively. If one or more zones 
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the 
entire node is deemed balanced. This causes balance_pgdat() to exit 
early before incrementing the kswapd_failures, as it considers the 
overall memory state acceptable, even though some zones (like 
ZONE_NORMAL) remain under significant pressure.

The new patch ensures that zone_reclaimable_pages() includes free pages 
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are 
available (e.g., file-backed or anonymous pages). This change prevents 
zones like ZONE_DMA32, which have sufficient free pages, from being 
mistakenly deemed unreclaimable. By doing so, the patch ensures proper 
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by 
allow_direct_reclaim(pgdat) repeatedly returning false.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-05 15:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-26 15:06 [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active Seiji Nishikawa
2024-11-28  0:49 ` Andrew Morton
2024-11-29  4:39   ` Seiji Nishikawa
2024-11-30 16:12     ` [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim() Seiji Nishikawa
2024-11-30 16:12       ` Seiji Nishikawa
2024-12-01  2:40         ` Andrew Morton
2024-12-01  4:11           ` Seiji Nishikawa
2024-11-30 16:43 Seiji Nishikawa
2024-11-30 16:43 ` Seiji Nishikawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox