From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 55C1BD7360B
	for <linux-mm@archiver.kernel.org>; Sat, 30 Nov 2024 16:12:49 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 6D1F86B007B; Sat, 30 Nov 2024 11:12:48 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6812D6B0083; Sat, 30 Nov 2024 11:12:48 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 549586B0085; Sat, 30 Nov 2024 11:12:48 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12])
	by kanga.kvack.org (Postfix) with ESMTP id 376E16B007B
	for <linux-mm@kvack.org>; Sat, 30 Nov 2024 11:12:48 -0500 (EST)
Received: from smtpin01.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay07.hostedemail.com (Postfix) with ESMTP id B1E34160B04
	for <linux-mm@kvack.org>; Sat, 30 Nov 2024 16:12:47 +0000 (UTC)
X-FDA: 82843254312.01.335745E
Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124])
	by imf15.hostedemail.com (Postfix) with ESMTP id 0BA9DA001F
	for <linux-mm@kvack.org>; Sat, 30 Nov 2024 16:12:35 +0000 (UTC)
Authentication-Results: imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="BLU/B5qe";
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf15.hostedemail.com: domain of snishika@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=snishika@redhat.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1732983161;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1hPTC8lAcmkEHMqYcDwBhrsAdI12Fh/C7IVNBK0GtTE=;
	b=wuUCXBccY/M9an+IaFnmaCtkU6bcD3s3UsXaNFPs1Pb63xPzZY0Eu1IUU4HtbxDXYP7Ywm
	EAEydGH8+6E3La6UPO110IwqGwvhKicxBUWMsWoi30Q+wKT6QiKeflX+9AxL18AI5u1HEZ
	MFS3LuHv4KLCUTCOuu3BKiBeixav1Cc=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732983161; a=rsa-sha256;
	cv=none;
	b=OKSckSWBMKJmisSIwOrgjtpVVcpnBArfi5QGciRqUoI+Ky6ukQH/uPKSBzYY2Ypn1Ydbhl
	TogaBSRgi7vj2e74rVo3syxw5pKI5z3rcAPy1lZ9EZz2a31On8GelsVzHOlgMgXVJWX6j+
	O6C3avUm0xTWgZw+H6k/8EELB6U8pWo=
ARC-Authentication-Results: i=1;
	imf15.hostedemail.com;
	dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="BLU/B5qe";
	dmarc=pass (policy=none) header.from=redhat.com;
	spf=pass (imf15.hostedemail.com: domain of snishika@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=snishika@redhat.com
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com;
	s=mimecast20190719; t=1732983165;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references;
	bh=1hPTC8lAcmkEHMqYcDwBhrsAdI12Fh/C7IVNBK0GtTE=;
	b=BLU/B5qerq5iOIJvfueyNfDiLA46L09CxcUOTsdMIm5YaKM0+rf1RPNQydDjLqHBtD8H6f
	DELkXUHDuIzq7GYyXGhXXRZYsiTZgwtZBd+43vqbsLTc82GvtDjSnyx4ZOoNUwjJ+INxv3
	NU17jPFpsQpAZlfSR8zuW2sAy6IMn7U=
Received: from mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com
 (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by
 relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3,
 cipher=TLS_AES_256_GCM_SHA384) id us-mta-223-6gARcCBDPiCoyqV7C3l0jQ-1; Sat,
 30 Nov 2024 11:12:41 -0500
X-MC-Unique: 6gARcCBDPiCoyqV7C3l0jQ-1
X-Mimecast-MFC-AGG-ID: 6gARcCBDPiCoyqV7C3l0jQ
Received: from mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.15])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by mx-prod-mc-03.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 805181955DC2;
	Sat, 30 Nov 2024 16:12:40 +0000 (UTC)
Received: from fedora.redhat.com (unknown [10.22.64.33])
	by mx-prod-int-02.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E80C11956089;
	Sat, 30 Nov 2024 16:12:37 +0000 (UTC)
From: Seiji Nishikawa <snishika@redhat.com>
To: akpm@linux-foundation.org
Cc: linux-mm@kvack.org,
	linux-kernel@vger.kernel.org,
	mgorman@techsingularity.net,
	snishika@redhat.com
Subject: [PATCH] mm: vmscan: account for free pages to prevent infinite Loop in throttle_direct_reclaim()
Date: Sun,  1 Dec 2024 01:12:33 +0900
Message-ID: <20241130161236.433747-1-snishika@redhat.com>
In-Reply-To: <20241129043936.316481-1-snishika@redhat.com>
References: <20241129043936.316481-1-snishika@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Scanned-By: MIMEDefang 3.0 on 10.30.177.15
X-Stat-Signature: 4nku3s5xtzu9u3dper17jsrtjeujhscm
X-Rspamd-Queue-Id: 0BA9DA001F
X-Rspam-User: 
X-Rspamd-Server: rspam01
X-HE-Tag: 1732983155-913855
X-HE-Meta: U2FsdGVkX18efXDLBHG8Z6cnWzXHKAWgNggyywSwzn0sDp1mT5m/H4W7TDOkXrgoEl7JbM5BhKdKsFsYG4Wnn9FFJ8JlTAIwud/coCMjYLVoo1yjuwvjfx1cNO65xC6H4d2lUs5k5LsCFB58slRlXVwJZ2E5Wjf1sbbI6R4rEJuPvIjzpT3D27nnDj97DVantCaSbLJ0L24F1Jm2Y3yvzEVf2jMVdcyX2TcLBqZRysukGu5+PYS5exRQsBjxH2yuAD3WLDn+hD8y9FnFGYVajTvjMrm+2TcdhjLOPwxyiINN9zc4x/TEaB3gfCrQl+ba7eUP2emZntzDCni5djVcU8elS1gcATbf2Ne5HAxwhmiKYcUcDGob1Qx5GrR9dBYx7fGFQDSrh/OquMBZLOPVxfHOAsqKIPr6Wp6ybf1dnGsam8zvyfVzwiz3tJGzb3lta/+Yl30FR8L3hgVG9pEERVNobbHHROevCErnbg7IPSTopkugOamAbyW1zOglayIvCo1BgnPYBweCn89DEjvxQhrYfpNggZ/74D5Kx/dZTZ6K6ogYKhk7ih3kmHw97ZJ+V6sC/Qk/dS8q+QIU0PDPiI6hcLNVNAV5LzZZP2zP/Wf4z94ew/IenMiaCqjekjOL8v8/Fg3RMxpd3r6acyrpnY1dQy3IX7DjMHOm3DTqR130VzmpVmvlWFr2u1nuDTPCkWwa+b2rgtfY+m8EZLDIvT1c9AI7C4ZBgVuNCzXuUatPU5+3plIPIFKjapXCUbUPJEi0NQxLifjEogrdnoYR/eI0zGOOxfd9iJ2Ra7xJ6smE+u1WOl7HaKqrUW3QMa3ZH29sYordjYNMSO0LSqNvGvUyCRiTsQFf3MnrnVJL3h11R6qFGMUmHMw5/WNa2GkMKgrihE1LgXJkQeokbZ1YWVVIJH0VF+UD5z3mHyDgY15jxRN1QeiirQoYI0AXF4YPZb7K9wPSfSa9d7vLc/M
 4WDJDArq
 IBg5ObN6DB3hpsOFh0cXtu7Iz/12PSQ6/Mjn9c8CZSh9XYPopNDCB1Sn7pPftoatdtsVAKCtZ0z2ESOoeqNO8Cjb3OEkzmFpSwrv0i/Y7Zi/T+ppmkB7YquBuOEbWxNLNk0sdvgfHRMI5BapoEJ5A68S2V29IQw6gYlawXx4ScYn77AANSNqIrwa5Pix4G/Jubs9hcRz+MBQWh9VgMRUqlXsSnZAKr6ny+FVcNFH8DcHmxtf7hMcnv77eFrlmZ7K6awids9LGd/Iay7S2nnpsQU1GaWUdsaw6tt68yC3qncLoGEwVSRiUZe4VmxealHuECTBrvW6PZZDMrnPWla1VHZyxwQYkiuyF8/0jgV0XLSRPXQLe/7LtPN7CSJdlViVHsTsa7KgpFWWWy9Zzx098wMNorzKTZoo9W4tvhywmG1p/UC58/w9WrCnYuNnZhQv4IFAEAk9WmEE6xppn5p5rHjXm4zNDTBhUzpAaBojzmAHrrMs=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Nov 29, 2024 at 1:39 PM Seiji Nishikawa <snishika@redhat.com> wrote:
>
> On Thu, Nov 28, 2024 at 9:49 AM Andrew Morton <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa <snishika@redhat.com> wrote:
> >
> > > Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use
> > > zone_page_state_snapshot"), a task may remain indefinitely stuck in
> > > throttle_direct_reclaim() while holding mm->rwsem.
> > >
> > > __alloc_pages_nodemask
> > >  try_to_free_pages
> > >   throttle_direct_reclaim
> > >
> > > This can cause numerous other tasks to wait on the same rwsem, leading
> > > to severe system hangups:
> > >
> > > [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds.
> > > [1088963.365653]       Tainted: G           OE     -------- -  - 4.18.0-553.el8_10.aarch64 #1
> > > [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [1088963.381862] task:python3         state:D stack:0     pid:1670971 ppid:1667117 flags:0x00800080
> > > [1088963.381869] Call trace:
> > > [1088963.381872]  __switch_to+0xd0/0x120
> > > [1088963.381877]  __schedule+0x340/0xac8
> > > [1088963.381881]  schedule+0x68/0x118
> > > [1088963.381886]  rwsem_down_read_slowpath+0x2d4/0x4b8
> > >
> > > The issue arises when allow_direct_reclaim(pgdat) returns false,
> > > preventing progress even when the pgdat->pfmemalloc_wait wait queue is
> > > empty. Despite the wait queue being empty, the condition,
> > > allow_direct_reclaim(pgdat), may still be returning false, causing it to
> > > continue looping.
> > >
> > > In some cases, reclaimable pages exist (zone_reclaimable_pages() returns
> > >  > 0), but calculations of pfmemalloc_reserve and free_pages result in
> > > wmark_ok being false.
> > >
> > > And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd
> > > is not woken up, further exacerbating the problem:
> > >
> > > crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx
> > > $775 = __MAX_NR_ZONES
> > >
> > > This patch modifies allow_direct_reclaim() to wake kswapd if the
> > > pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is
> > > true or false. This change ensures kswapd does not miss wake-ups under
> > > high memory pressure, reducing the risk of task stalls in the throttled
> > > reclaim path.
> >
> > The code which is being altered is over 10 years old.
> >
> > Is this misbehavior more recent?  If so, are we able to identify which
> > commit caused this?
>
> The issue is not new but may have become more noticeable after commit
> 501b26510ae3, which improved precision in allow_direct_reclaim(). This
> change exposed edge cases where wmark_ok is false despite reclaimable
> pages being available.
>
> > Otherwise, can you suggest why it took so long for this to be
> > discovered?  Your test case must be doing something unusual?
>
> The issue likely occurs under specific conditions: high memory pressure
> with frequent direct reclaim, contention on mmap_sem from concurrent
> memory allocations, reclaimable pages exist, but zone states cause
> wmark_ok to return false.
>
> Modern workloads (e.g., Python multiprocessing) and changes in kernel
> reclaim logic may have surfaced such edge cases more prominently than
> before.
>
> The workload involves concurrent Python processes under high memory
> pressure, leading to contention on mmap_sem. While not unusual, this
> workload may trigger a rare combination of conditions that expose the
> issue.
>
> >
> > > --- a/mm/vmscan.c
> > > +++ b/mm/vmscan.c
> > > @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat)
> > >
> > >       wmark_ok = free_pages > pfmemalloc_reserve / 2;
> > >
> > > -     /* kswapd must be awake if processes are being throttled */
> > > -     if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) {
> > > +     /* Always wake up kswapd if the wait queue is not empty */
> > > +     if (waitqueue_active(&pgdat->kswapd_wait)) {
> > >               if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL)
> > >                       WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL);
> > >
>

Through further extensive debugging, it has been revealed that the 
interpretation that kswapd was not woken up even when 
(!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) held true was 
incorrect. 

Every time kswapd() runs, it overwrites pgdat->kswapd_highest_zoneidx 
with MAX_NR_ZONES, hence it is __MAX_NR_ZONES just at the time when this
dump is captured.

The task continues looping in throttle_direct_reclaim() because 
allow_direct_reclaim(pgdat) keeps returning false. 

 #0 [ffff80002cb6f8d0] __switch_to at ffff8000080095ac
 #1 [ffff80002cb6f900] __schedule at ffff800008abbd1c
 #2 [ffff80002cb6f990] schedule at ffff800008abc50c
 #3 [ffff80002cb6f9b0] throttle_direct_reclaim at ffff800008273550
 #4 [ffff80002cb6fa20] try_to_free_pages at ffff800008277b68
 #5 [ffff80002cb6fae0] __alloc_pages_nodemask at ffff8000082c4660
 #6 [ffff80002cb6fc50] alloc_pages_vma at ffff8000082e4a98
 #7 [ffff80002cb6fca0] do_anonymous_page at ffff80000829f5a8
 #8 [ffff80002cb6fce0] __handle_mm_fault at ffff8000082a5974
 #9 [ffff80002cb6fd90] handle_mm_fault at ffff8000082a5bd4

At this point, the pgdat contains the following two zones:

        NODE: 4  ZONE: 0  ADDR: ffff00817fffe540  NAME: "DMA32"
          SIZE: 20480  MIN/LOW/HIGH: 11/28/45
          VM_STAT:
                NR_FREE_PAGES: 359
        NR_ZONE_INACTIVE_ANON: 18813
          NR_ZONE_ACTIVE_ANON: 0
        NR_ZONE_INACTIVE_FILE: 50
          NR_ZONE_ACTIVE_FILE: 0
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

        NODE: 4  ZONE: 1  ADDR: ffff00817fffec00  NAME: "Normal"
          SIZE: 8454144  PRESENT: 98304  MIN/LOW/HIGH: 68/166/264
          VM_STAT:
                NR_FREE_PAGES: 146
        NR_ZONE_INACTIVE_ANON: 94668
          NR_ZONE_ACTIVE_ANON: 3
        NR_ZONE_INACTIVE_FILE: 735
          NR_ZONE_ACTIVE_FILE: 78
          NR_ZONE_UNEVICTABLE: 0
        NR_ZONE_WRITE_PENDING: 0
                     NR_MLOCK: 0
                    NR_BOUNCE: 0
                   NR_ZSPAGES: 0
            NR_FREE_CMA_PAGES: 0

In allow_direct_reclaim(), while processing ZONE_DMA32, the sum of 
inactive/active file-backed pages calculated in zone_reclaimable_pages()
based on the result of zone_page_state_snapshot() is zero. 

Additionally, since this system lacks swap, the calculation of inactive/
active anonymous pages is skipped.

        crash> p nr_swap_pages
        nr_swap_pages = $1937 = {
          counter = 0
        }

As a result, ZONE_DMA32 is deemed unreclaimable and skipped, moving on 
to the processing of the next zone, ZONE_NORMAL, despite ZONE_DMA32 
having free pages significantly exceeding the high watermark.

The problem is that the pgdat->kswapd_failures hasn't been incremented.

        crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_failures
        $1935 = 0x0

This is because the node deemed balanced. The node balancing logic in 
balance_pgdat() evaluates all zones collectively. If one or more zones 
(e.g., ZONE_DMA32) have enough free pages to meet their watermarks, the 
entire node is deemed balanced. This causes balance_pgdat() to exit 
early before incrementing the kswapd_failures, as it considers the 
overall memory state acceptable, even though some zones (like 
ZONE_NORMAL) remain under significant pressure.

The new patch ensures that zone_reclaimable_pages() includes free pages 
(NR_FREE_PAGES) in its calculation when no other reclaimable pages are 
available (e.g., file-backed or anonymous pages). This change prevents 
zones like ZONE_DMA32, which have sufficient free pages, from being 
mistakenly deemed unreclaimable. By doing so, the patch ensures proper 
node balancing, avoids masking pressure on other zones like ZONE_NORMAL,
and prevents infinite loops in throttle_direct_reclaim() caused by 
allow_direct_reclaim(pgdat) repeatedly returning false.