From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 18812D70E16 for ; Fri, 29 Nov 2024 04:39:50 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id F12826B0083; Thu, 28 Nov 2024 23:39:49 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E9B5E6B0085; Thu, 28 Nov 2024 23:39:49 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D14F66B0088; Thu, 28 Nov 2024 23:39:49 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AD1C36B0083 for ; Thu, 28 Nov 2024 23:39:49 -0500 (EST) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 25AE4120D5B for ; Fri, 29 Nov 2024 04:39:49 +0000 (UTC) X-FDA: 82837879320.19.52E47F0 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf11.hostedemail.com (Postfix) with ESMTP id E00F140003 for ; Fri, 29 Nov 2024 04:39:38 +0000 (UTC) Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hZk5qmH2; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of snishika@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=snishika@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1732855183; a=rsa-sha256; cv=none; b=4WUuG4GfKvypjbPegPP4cjtefeaSrULpOvIVekvuNOrUtXAVHHrhaQCeS5FAYTIoq1hOpz RNC5U42VUUv3BXkXXA6uT3UWzJqPwc0DsCuQWmNijyprE4BCOifDKcoroLavIJXiWTM3rz 0zobUXtts1OhgxYnTj2XPcResXmALIU= ARC-Authentication-Results: i=1; imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=hZk5qmH2; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf11.hostedemail.com: domain of snishika@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=snishika@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1732855183; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RzFXXHDdcs/VvVbBdMyuNhGF0MKwUMd0fbOPDUe9+/o=; b=k3o2ptCajThk+4mCbqpiKntiv+jzZYlvPWHl8us4NUVlbbjjgQSkULJ2VDJd6xiiWYq89h gYyzRzs9G6NQ+KRB1ke1uTtihEUu7hzwggvxq36ce7sfHLILMlqIjlPIgyNbqW8TvGBAtc wc5Mp1WsnEW19AyF2VHGpjdenCNnkAs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1732855186; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=RzFXXHDdcs/VvVbBdMyuNhGF0MKwUMd0fbOPDUe9+/o=; b=hZk5qmH2Uhhyv3JIydQshAYBHDQkaphXU44+dCnBn2afOjghri7xuhs6Ujj4HwJ3jPDMfQ v/Bfz7RWy9/r00c4ctZfcyFP3lkV+7Lea/ZbZxZVGCD6xIEXKUisFMJxBaYxAXxnggZ181 0deeIHTXMHDGxkw3+dDz/RCC46ThjNg= Received: from mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (ec2-54-186-198-63.us-west-2.compute.amazonaws.com [54.186.198.63]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-260-RMzaHnOYNgulRJu-4OcHCg-1; Thu, 28 Nov 2024 23:39:42 -0500 X-MC-Unique: RMzaHnOYNgulRJu-4OcHCg-1 X-Mimecast-MFC-AGG-ID: RMzaHnOYNgulRJu-4OcHCg Received: from mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com [10.30.177.40]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by mx-prod-mc-01.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTPS id 574CC19560B5; Fri, 29 Nov 2024 04:39:41 +0000 (UTC) Received: from fedora.redhat.com (unknown [10.22.64.33]) by mx-prod-int-04.mail-002.prod.us-west-2.aws.redhat.com (Postfix) with ESMTP id E81A6195605A; Fri, 29 Nov 2024 04:39:38 +0000 (UTC) From: Seiji Nishikawa To: akpm@linux-foundation.org Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, mgorman@techsingularity.net, snishika@redhat.com Subject: Re: [PATCH] mm: vmscan: ensure kswapd is woken up if the wait queue is active Date: Fri, 29 Nov 2024 13:39:35 +0900 Message-ID: <20241129043936.316481-1-snishika@redhat.com> In-Reply-To: <20241127164948.74659f9400fd076760c2a670@linux-foundation.org> References: <20241127164948.74659f9400fd076760c2a670@linux-foundation.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 3.0 on 10.30.177.40 X-Rspam-User: X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: E00F140003 X-Stat-Signature: 65ywh36g95ij7gfssqbde7w4ktm69gg6 X-HE-Tag: 1732855178-942335 X-HE-Meta: U2FsdGVkX18a4e+jFuZGsRlHGLs6Mv6Xr7NuTrxZdJJ+oCiY7316kTcPCDIps+QMKfGAO6uEpZ0McC0yhjXtq34x8ARscgtpvvMrgCsD4K8Ylw9hsFiYvf+JtOgqER4SVv2Nr2U+YWSg4bHg0c+7j/PX+WanZxI1frEC5ITw3iP1hnBWvISJ0LzVLfBPe77jhzzbnWi6HrOG4u1UGoKFc3IOiw69SnS8mqlSseZNCu+JLEWeTlGcT3uQS/k211J0wPLEY/Ffr1p3dUVweuWDZErv2wSpMTP4fvvRE1InH02Y3Nc3nAhlw0iieC5PuO3fFjyOYpjgxcmE1i5vPghxSUHA2OMPLmgeY/OaZRLOsx6+t/PiYROIt62m/j4xWRsZiG/qgm9nS47ItfmJ5VMxsAAcdeiDLIHNfz7FnjQkDj8Qy1I5qGWX8ClIqVV/I7Cffs1ynazk0D+GdnrOTvG79RW+/2ZvhzjqVuKV60hIWnmvWtIykUT+ys45RJ1TBAuM/1EmYOj5BobVK4DgWKyZM41wC8i89TJHJ5B9H1hMSUcwBByLIVr/EOggeqcCc447Jh++aZyUl1AzZx08HOAm7zQYR978YQiEc5X4WgF+KolpCZhU5w1H/ujLQPIw7ArJTUEIoYIHhlVjtAu3MaCZNhUrPasApymW5s2fm4XhN87ZOUQYHK+p1BbjMQYTYh3fqcPmQeITvWUePVMMFFyDERyfwU3E2DUMZxQBmIOYuU19fzJnG70jLG1ZUdKthOwlT0Vhr4GQn9r9gkA3MDULthFri1rO3RmbrwmK4oDKg+4g5Mz87qRFc6Tc4YzRgZu/VTJhl/ZHZyWyXZdddwjzaAsY4BHivDThz9BKyzKNN3PlJ3959cfb1PWQ7APJBs3acjRc3zOStptf6POIM8B/xMgptDSrveC9MGEDREr64sSenI74r22+GrcCAJlin8K8ErGhJ6wEfEKHcGKUBYD dFt8F7Ch b+cAwynude3+EBR5P+8rvH6jh1ynGfYwcZ3o3gyUvQy8GYo0zB4sxL6OmUzcBxZIy1WDnBFmo5SXyMTlGbJvVUjD+otfmo/q7zI3aPaMRdkH7X5jf+6Hm7EEezwFfQOiTCiukC92KSHWE6501gqQgTfcIfp54kXKRzsyzWHgqUgW2MaRKDzPZPn1qjVJSRsyXbileFpQOwmBxbP+Sd7wexUYd+mzzek6UBeVybs1+ZrrFwDOmfLJ3KOXKGLbRB844FMP0ODSZCjwHTr3fvknGDRHJWy/5aFwf2LB5ubmcig+ataqXAoVFMzAwY3jG3DpcvWvsJ+t2CTLXaFZ6H0XokplflH6pV/lbHQbC1Z6z9JzcYkDLMoSL959oIiULNBhKov/713GxcoRpmmgX6XZZ8oiuG5bOw48QYi9d X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, Nov 28, 2024 at 9:49 AM Andrew Morton wrote: > > On Wed, 27 Nov 2024 00:06:12 +0900 Seiji Nishikawa wrote: > > > Even after commit 501b26510ae3 ("vmstat: allow_direct_reclaim should use > > zone_page_state_snapshot"), a task may remain indefinitely stuck in > > throttle_direct_reclaim() while holding mm->rwsem. > > > > __alloc_pages_nodemask > > try_to_free_pages > > throttle_direct_reclaim > > > > This can cause numerous other tasks to wait on the same rwsem, leading > > to severe system hangups: > > > > [1088963.358712] INFO: task python3:1670971 blocked for more than 120 seconds. > > [1088963.365653] Tainted: G OE -------- - - 4.18.0-553.el8_10.aarch64 #1 > > [1088963.373887] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > [1088963.381862] task:python3 state:D stack:0 pid:1670971 ppid:1667117 flags:0x00800080 > > [1088963.381869] Call trace: > > [1088963.381872] __switch_to+0xd0/0x120 > > [1088963.381877] __schedule+0x340/0xac8 > > [1088963.381881] schedule+0x68/0x118 > > [1088963.381886] rwsem_down_read_slowpath+0x2d4/0x4b8 > > > > The issue arises when allow_direct_reclaim(pgdat) returns false, > > preventing progress even when the pgdat->pfmemalloc_wait wait queue is > > empty. Despite the wait queue being empty, the condition, > > allow_direct_reclaim(pgdat), may still be returning false, causing it to > > continue looping. > > > > In some cases, reclaimable pages exist (zone_reclaimable_pages() returns > > > 0), but calculations of pfmemalloc_reserve and free_pages result in > > wmark_ok being false. > > > > And then, despite the pgdat->kswapd_wait queue being non-empty, kswapd > > is not woken up, further exacerbating the problem: > > > > crash> px ((struct pglist_data *) 0xffff00817fffe540)->kswapd_highest_zoneidx > > $775 = __MAX_NR_ZONES > > > > This patch modifies allow_direct_reclaim() to wake kswapd if the > > pgdat->kswapd_wait queue is active, regardless of whether wmark_ok is > > true or false. This change ensures kswapd does not miss wake-ups under > > high memory pressure, reducing the risk of task stalls in the throttled > > reclaim path. > > The code which is being altered is over 10 years old. > > Is this misbehavior more recent? If so, are we able to identify which > commit caused this? The issue is not new but may have become more noticeable after commit 501b26510ae3, which improved precision in allow_direct_reclaim(). This change exposed edge cases where wmark_ok is false despite reclaimable pages being available. > Otherwise, can you suggest why it took so long for this to be > discovered? Your test case must be doing something unusual? The issue likely occurs under specific conditions: high memory pressure with frequent direct reclaim, contention on mmap_sem from concurrent memory allocations, reclaimable pages exist, but zone states cause wmark_ok to return false. Modern workloads (e.g., Python multiprocessing) and changes in kernel reclaim logic may have surfaced such edge cases more prominently than before. The workload involves concurrent Python processes under high memory pressure, leading to contention on mmap_sem. While not unusual, this workload may trigger a rare combination of conditions that expose the issue. > > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -6389,8 +6389,8 @@ static bool allow_direct_reclaim(pg_data_t *pgdat) > > > > wmark_ok = free_pages > pfmemalloc_reserve / 2; > > > > - /* kswapd must be awake if processes are being throttled */ > > - if (!wmark_ok && waitqueue_active(&pgdat->kswapd_wait)) { > > + /* Always wake up kswapd if the wait queue is not empty */ > > + if (waitqueue_active(&pgdat->kswapd_wait)) { > > if (READ_ONCE(pgdat->kswapd_highest_zoneidx) > ZONE_NORMAL) > > WRITE_ONCE(pgdat->kswapd_highest_zoneidx, ZONE_NORMAL); > >