linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active
@ 2024-11-26 15:06 Seiji Nishikawa
  2024-11-28  0:49 ` Andrew Morton
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-26 15:06 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, snishika

Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
zone_page_state_snapshot"), a task may remain indefinitely stuck in
throttle_direct_reclaim() while holding mm->rwsem.

__alloc_pages_nodemask
 try_to_free_pages
  throttle_direct_reclaim

This can cause numerous other tasks to wait on the same rwsem, leading
to severe system hangups:

[1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
[1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
[1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
[1088963.381869] Call trace:
[1088963.381872]  __switch_to+0xd0/0x120
[1088963.381877]  __schedule+0x340/0xac8
[1088963.381881]  schedule+0x68/0x118
[1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8

The issue arises when allow_direct_reclaim(pgdat) returns false,
preventing progress even when the pgdat->pfmemalloc_wait wait queue is
empty. Despite the wait queue being empty, the condition,
allow_direct_reclaim(pgdat), may still be returning false, causing it to
continue looping.

In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
 > 0), but calculations of pfmemalloc_reserve and free_pages result in
wmark_ok being false.

And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
is not woken up, further exacerbating the problem:

crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
$775 = __MAX_NR_ZONES

This patch modifies allow_direct_reclaim() to wake kswapd if the
pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
true or false. This change ensures kswapd does not miss wake-ups under
high memory pressure, reducing the risk of task stalls in the throttled
reclaim path.

Signed-off-by: Seiji Nishikawa <snishika@redhat.com>
---
 mm/vmscan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 76378bc257e3..b1b3e5a116a8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
 
 	wmark_ok = free_pages > pfmemalloc_reserve / 2;
 
-	/* kswapd must be awake if processes are being throttled */
-	if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
+	/* Always wake up kswapd if the wait queue is not empty */
+	if (waitqueue_active(&pgdat->kswapd_wait)) {
 		if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
 			WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
 
-- 
2.47.0



^ permalink raw reply	[flat|nested] 9+ messages in thread
* [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
@ 2024-11-30 16:43 Seiji Nishikawa
  2024-11-30 16:43 ` Seiji Nishikawa
  0 siblings, 1 reply; 9+ messages in thread
From: Seiji Nishikawa @ 2024-11-30 16:43 UTC (permalink / raw)
  To: akpm, linux-mm; +Cc: linux-kernel, mgorman, snishika

The task sometimes continues looping in throttle_direct_reclaim() 
because allow_direct_reclaim(pgdat) keeps returning false. 

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of 
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero. 

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on 
to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 
having free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced. The node balancing logic in 
balance_pgdat() evaluates all zones collectively. If one or more zones 
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the 
entire node is deemed balanced. This causes balance_pgdat() to exit 
early before incrementing the kswapd_failures, as it considers the 
overall memory state acceptable, even though some zones (like 
ZONE_NORMAL) remain under significant pressure.

The new patch ensures that zone_reclaimable_pages() includes free pages 
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are 
available (e.g., file-backed or anonymous pages). This change prevents 
zones like ZONE_DMA32, which have sufficient free pages, from being 
mistakenly deemed unreclaimable. By doing so, the patch ensures proper 
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by 
allow_direct_reclaim(pgdat) repeatedly returning false.



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-12-05 15:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-11-26 15:06 [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active Seiji Nishikawa
2024-11-28  0:49 ` Andrew Morton
2024-11-29  4:39   ` Seiji Nishikawa
2024-11-30 16:12     ` [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim() Seiji Nishikawa
2024-11-30 16:12       ` Seiji Nishikawa
2024-12-01  2:40         ` Andrew Morton
2024-12-01  4:11           ` Seiji Nishikawa
2024-11-30 16:43 Seiji Nishikawa
2024-11-30 16:43 ` Seiji Nishikawa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox