From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) (using TLSv1 with cipher DHE-RSA-AES256-SHA (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 3CFEBE8B389 for ; Wed, 4 Feb 2026 02:07:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 8ABC96B0088; Tue, 3 Feb 2026 21:07:18 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 858A96B0089; Tue, 3 Feb 2026 21:07:18 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 72D636B008A; Tue, 3 Feb 2026 21:07:18 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 5C3C66B0088 for ; Tue, 3 Feb 2026 21:07:18 -0500 (EST) Received: from smtpin09.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2691D160677 for ; Wed, 4 Feb 2026 02:07:18 +0000 (UTC) X-FDA: 84405136956.09.B21C0B0 Received: from mail-qk1-f176.google.com (mail-qk1-f176.google.com [209.85.222.176]) by imf22.hostedemail.com (Postfix) with ESMTP id 2E105C0012 for ; Wed, 4 Feb 2026 02:07:16 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JQhs1JCx; spf=pass (imf22.hostedemail.com: domain of akinobu.mita@gmail.com designates 209.85.222.176 as permitted sender) smtp.mailfrom=akinobu.mita@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1770170836; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=P3XKctSvXGgVpW/lX1QkUiLVHuzybB4KcosEl9SXxcg=; b=gwPt5XW55qVtv6doJxdneMsI0ag41x4ejIu/YYZvuYMUUr+uxWaB2xG0+7lyptJZd4Kqbe 6sZc63C8f87HgH4uhehi45fG20Ns8oxoQjzjxVEP8PE9v52UT5yjTd6SoSzLevtG/76grZ 247CQ9iEyqccW3NabKOyeKqjT4/ZOIM= ARC-Authentication-Results: i=2; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=JQhs1JCx; spf=pass (imf22.hostedemail.com: domain of akinobu.mita@gmail.com designates 209.85.222.176 as permitted sender) smtp.mailfrom=akinobu.mita@gmail.com; dmarc=pass (policy=none) header.from=gmail.com; arc=pass ("google.com:s=arc-20240605:i=1") ARC-Seal: i=2; s=arc-20220608; d=hostedemail.com; t=1770170836; a=rsa-sha256; cv=pass; b=ygLOg2ezr4VEcP2YODCqhwgd5aYQLc19G/UepFiluokpjPyPQI/wQzGtYTFJPWJ9/9s4gQ IDqcLC322QHSaRh5/ol/HzJDGtfM3ebPnxZheGmzUb6ZqQrqWJrS0sjCZ6OMnRIf0urCop aex+wZU1ybjrvjOEMZzNn6xzVhogarQ= Received: by mail-qk1-f176.google.com with SMTP id af79cd13be357-8c5386f1c9fso856952185a.1 for ; Tue, 03 Feb 2026 18:07:15 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1770170835; cv=none; d=google.com; s=arc-20240605; b=Bz2YrctULUn3fe1rreORR9tBYgrQ/IYKaBJc+mfHEbDBIWgidlFiHXI100FPKjwBwL gBRgsQzNz2WBX72VUknquOcEIB0i5E0G4m2HwwEmVJv0i33UTh5fth8sT+bunFhs1+39 SXWNNCwdYk89amifP/TTYuN8KXfWswMY/73wauEK5f0dQwbh3SaW7gByykJT1iDOnucB A4Ot5ps6ci6Uh6zDzhp2KMjKFh3tl09zhVRScFsWwoV2L3BR6NTXYm9V6BCYnhX86mSw H/vNa1Hj0/urrcNZPGniHBS1lw1K0F/KN4Xsb6wXgkVNL2h+aW8kgvcvMhFEfkvNvjDm YwTA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20240605; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=P3XKctSvXGgVpW/lX1QkUiLVHuzybB4KcosEl9SXxcg=; fh=wxsvq6FMgmxZq2NThXEarPuzvQxca4QRebrQQssydtw=; b=VwEYyvuU7E36jkPXgQuDW0MQwfVkpmYkTy4z6qZIE8wLuhM7OqFBw/6mtnJW7uGMI9 KG3LtVPxyRKbZwL0dJMqTXAdgGRlMyOwAmYoGquCvXfgl4aSXvPy9J0+wRNLdk/PoQ3Y p382/Cu4oP+TV2yW1/4G0wd6MUCx/9sdxcBMfTQnWyHpM56IE6IKzRJ/9OZXkSt4eNSY JflO2rAVLz+jgIs29pmFApXy/cQcgf9ATxvvN7NjiATmxva2NVz3z0kM/kb3RNXDkAwC p0MEv1MgyA6s6k3Gj7E7T7K/K0SgqBHBU0YZ63T7qef8NwdtmMPUDMQKIZnWq8JLnhUv H85w==; darn=kvack.org ARC-Authentication-Results: i=1; mx.google.com; arc=none DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1770170835; x=1770775635; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=P3XKctSvXGgVpW/lX1QkUiLVHuzybB4KcosEl9SXxcg=; b=JQhs1JCx1uI6MjN0E1KXBVJmO+FPbZ0WmUSMVzyysSX5rJyTWGik9PiWgMaK8jaFa0 C5CTyvKlugPF0wa5yr2MXSMyCJN9ciZnAZdPZFpEZ72F3EagQsVdLz19jD+oiu+QDyFr JW99HOEJlDNSYQcqPeNUebCIObhCHfHgdDq78ZWZJ88bCHUx1ZIaHU2ZH0D5GWU1x6Pb EqR34Xvefpzcl+H9RgXH36SMX32OXPxUTHvBpQajPuiR9v7kjkfuDz1xCeo4WTjBCOu9 lwbQOODpqT64SRoDSKDwjU1SDHbucXGyF29Vri1YpvPx4mgbAhkAU1Bs634tvgBx8n1C iK7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1770170835; x=1770775635; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=P3XKctSvXGgVpW/lX1QkUiLVHuzybB4KcosEl9SXxcg=; b=Ah96Y1nReuaddieUo8qoF4nRaVEUXtNFzN2z333t6W1kbbPB3vptfA5y8XrIdmyH4Z hmno2kfWZZ/nE95+SSwxibed9i/NDQcaif3N0vZdOXgKpT1TAvjmjFIzPpVIsPK9U6vp 3E/8DIRgWz4FREzAK2/kxg/VdG9Fln2EnJR0XLj5UXA7mI25dEnpsLg4AT72X1MbPs0c 4ptGc+lbKmcd7H5YgXDmmhKpgQvErWh6xoKPUf0s0N7YsSvosFctgg6Qyp+eDw4N5XSq CBXCdj6uxQR+pBhKRREvqGGWtfDQSsZHtqj/+alJ/fI7Flk4FymbkGpRThXRxYo+Oox6 JYbw== X-Forwarded-Encrypted: i=1; AJvYcCWj1sffYQ/IoEd3anuTMFyG1nWDeBhvQxf01fDM7/fraFrzsKnRyWnRO+nqRTH0DDcrmsGjEf7RUQ==@kvack.org X-Gm-Message-State: AOJu0Yx8uosKpC0p8pOBO3ig+AzkgxMppHt8/3AGOfyYENYsSIku3aBH JVRwEyEArSbeyRJ9FtJLMnVP1m/DRZPyG1Vajt2XXl9hOHqZMN/yx6cj5wK8KEm6o9P3T/iab/A TJw/YGwAS6uv8OqASXO1XqntUwYppFcQ= X-Gm-Gg: AZuq6aKVgw5BXbhjBLcNRdfuxhSgTaZ9Q2ameahhsX/R0tdfh2SW/BirpiXzvfDnAqF 8pXtPWnNDNAt5DM7TmwJAx4mwk1jytmiXXqXZi3o0eU3EyP03u5UCrldO6qyfcEXFuHEDw+VqG2 Tli74Yeel7vL7sd19pewlkyZW1FIn6dWIr76wIjOUp7wI8NgnH2EkdyKv+wv1k/SKQ8ptcuMPxL RR+gRt69qRmHMAwNflNJ3SwzJZmWOVurg7bGyZvMzjZ/IqMfjJI++fEtxNkrP8uVsNUg2/pXkV1 MJMnChVNeciMFMtMD/Zcp4A= X-Received: by 2002:a05:620a:288e:b0:8ca:2cfa:822e with SMTP id af79cd13be357-8ca2f9f5b94mr235535285a.70.1770170835165; Tue, 03 Feb 2026 18:07:15 -0800 (PST) MIME-Version: 1.0 References: <20260127220003.3993576-1-joshua.hahnjy@gmail.com> In-Reply-To: From: Akinobu Mita Date: Wed, 4 Feb 2026 11:07:03 +0900 X-Gm-Features: AZwV_Qh72QJTI4M2CEIA0M9FqVyxcZFhJhrBbyaUc7tMrpfUO9BLxMEmywo9FJo Message-ID: Subject: Re: [PATCH v4 3/3] mm/vmscan: don't demote if there is not enough free memory in the lower memory tier To: Michal Hocko Cc: Joshua Hahn , linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, akpm@linux-foundation.org, axelrasmussen@google.com, yuanchu@google.com, weixugc@google.com, hannes@cmpxchg.org, david@kernel.org, zhengqi.arch@bytedance.com, shakeel.butt@linux.dev, lorenzo.stoakes@oracle.com, Liam.Howlett@oracle.com, vbabka@suse.cz, rppt@kernel.org, surenb@google.com, ziy@nvidia.com, matthew.brost@intel.com, rakie.kim@sk.com, byungchul@sk.com, gourry@gourry.net, ying.huang@linux.alibaba.com, apopple@nvidia.com, bingjiao@google.com, jonathan.cameron@huawei.com, pratyush.brahma@oss.qualcomm.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: 2E105C0012 X-Stat-Signature: ytu4j15rjgtc3xmy1redgpzakj8z9o1t X-Rspam-User: X-HE-Tag: 1770170836-435103 X-HE-Meta: U2FsdGVkX1/wC8ruQXJv9BSbxIs4iJsZKoIKOpl4y8gis8N6DcRLd6gNj2aMsxmf/BYyYr88OxqcI7mhSgxVDJfFygDyfDcgU2jAK+cn+Cd2MbR2GdMKptucTCydYp3QhP5rh6rHCg8qcMBYztnNgetqnFYVUt+t16ptKSQj43Doq8rTklcZFnt8kRmPkRMz+fvg+Hh88VjSNq1VkUADAQPYlB24CdQ35b4NDXT7edTYbd36WJpWsRYJqtoViVDLhaDIgdHRUlVJrieBpGOHuv8Odl2GwOToNbvYVpHtBOGs2ncdw01ihfcdQ6lyt5fQPk5rpBv7bxVWBKGkfMgl36zQDzTpCHYgyKDpkJSuTd3rmbCVPm7mqZg00+9tYhLgPyohMGi4EiXRY9aJXmWVMvE61Ugc/SDRD7vgqJlGMDC08xPqtC0QeJp6CuRtud/Iux3HHwzLv97tejKJj1U2IX3uRuPX2I+v1Glzp8rCdwPL9AaCAQ5NYmwV7Q0kNgwTJLO+S0V94i2JCuer3ETRhZFSsYzun6FRJMB+SkfguCvKKhXKttObPOyc7Sg8ipIzYi9Cezsmqyd19H2acNGFAxMq0tLkTMXqosKbON3aiT0oRr6bPqViiC5eKFIyhZBgJ8mIvp5MNiyAtOVv9sFlYq6Qjo9B03hjbr9IjlRnae+3Hn/+DILaKzOrQqBpQEYX90A1Nh06QHtfLGlWFR/OB8s/ssm+q3J3UI1ENPyygJUJJVmJlXrqLvu2mzpLmvYSJtjFU9UZb1SBeBqr3IoeZ8VdW4w1KVRZ8syskWqf+XAoIP5FNBgY4KZzxhbGupSDz4dg3xknflzFAE4aaWIS+ha7zAroGUMEj8vJWY8OZcnYts6+OR8QX66YeLn1hE6h/zsL0vePGJIQUmBqTLI49xGdtdq1eEpvNhv7uV/RUqHndrG2CAS859pFhRkIHtloAOY8VYcGcamwuQbzBuk uo/8J4dq gc9Nfxs6Ob17cQMAXT6gFCE6XZRoLpplWS4JPubLVqTKfTD8giUmOHRN2sFnXNzhJ6Bn6m0M12+2lBZkIgNGoGGcDDDgpzpeTAUXRaUuFYT1+Sx6WgQiD/Sq/KkqJTPIUhiqqjaIC2IcUXlInk65apE1HmsbZAemVEOom1BpbXtUdNnBoLQDZstGJHEW5koL3JhPcV+d/QFHuSuaHZSCKQz7EshUsLOZA13AgJjpHcr2gjwFON/HSESDzXR97qRoFtNzJSS7qq/4FUfXHc9UVVrEtOXDDb4U1TL+BmazQMukqQLxjtaR6ViJyCJCv2lKfXNQTzw1QpcuVB8c4cCs8BW1eoGiLUNEe6BVl9LwD+1ZZYbbmXXCQZYZjMGUhebfKL2BSGhna1H9smiDkL8Q4oEauX3fZklvPHhgciA35xj8NdCjZ8pbLREAlUssIbOf4/bsw0OkDKXpMCBS51tqkIyupDBidqa3YnJXMBjjabrDqejZhBPDX+eFJB8FK0/L5Bq06LOkvYiSJtgBcCXjd2kC+7ae9xQKuvSZ6ND96HeL6lB65tWZCkKP7js1ox7RPWL8QC5FLwAc4MMSmh3HAY69MG0PNNqOeivszDf5ux4l8ezg= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: 2026=E5=B9=B42=E6=9C=882=E6=97=A5(=E6=9C=88) 22:11 Michal Hocko : > > On Thu 29-01-26 09:40:17, Akinobu Mita wrote: > > 2026=E5=B9=B41=E6=9C=8828=E6=97=A5(=E6=B0=B4) 7:00 Joshua Hahn : > > > > > > > > > Therefore, it appears that the behavior of get_swappiness() is = important > > > > > > in this issue. > > > > > > > > > > This is quite mysterious. > > > > > > > > > > Especially because get_swappiness() is an MGLRU exclusive functio= n, I find > > > > > it quite strange that the issue you mention above occurs regardle= ss of whether > > > > > MGLRU is enabled or disabled. With MGLRU disabled, did you see th= e same hangs > > > > > as before? Were these hangs similarly fixed by modifying the call= site in > > > > > get_swappiness? > > > > > > > > Good point. > > > > When MGLRU is disabled, changing only the behavior of can_demote() > > > > called by get_swappiness() did not solve the problem. > > > > > > > > Instead, the problem was avoided by changing only the behavior of > > > > can_demote() called by can_reclaim_anon_page(), without changing th= e > > > > behavior of can_demote() called from other places. > > > > > > > > > On a separate note, I feel a bit uncomfortable for making this th= e default > > > > > setting, regardless of whether there is swap space or not. Just a= s it is > > > > > easy to create a degenerate scenario where all memory is unreclai= mable > > > > > and the system starts going into (wasteful) reclaim on the lower = tiers, > > > > > it is equally easy to create a scenario where all memory is very = easily > > > > > reclaimable (say, clean pagecache) and we OOM without making any = attempt to > > > > > free up memory on the lower tiers. > > > > > > > > > > Reality is likely somewhere in between. And from my perspective, = as long as > > > > > we have some amount of easily reclaimable memory, I don't think i= mmediately > > > > > OOMing will be helpful for the system (and even if none of the me= mory is > > > > > easily reclaimable, we should still try doing something before ki= lling). > > > > > > > > > > > > > The reason for this issue is that memory allocations do not= directly > > > > > > > > trigger the oom-killer, assuming that if the target node ha= s an underlying > > > > > > > > memory tier, it can always be reclaimed by demotion. > > > > > > > > > > This patch enforces that the opposite of this assumption is true;= that even > > > > > if a target node has an underlying memory tier, it can never be r= eclaimed by > > > > > demotion. > > > > > > > > > > Certainly for systems with swap and some compression methods (z{r= am, swap}), > > > > > this new enforcement could be harmful to the system. What do you = think? > > > > > > > > Thank you for the detailed explanation. > > > > > > > > I understand the concern regarding the current patch, which only > > > > checks the free memory of the demotion target node. > > > > I will explore a solution. > > > > > > Hello Akinobu, I hope you had a great weekend! > > > > > > I noticed something that I thought was worth flagging. It seems like = the > > > primary addition of this patch, which is to check for zone_watermark_= ok > > > across the zones, is already a part of should_reclaim_retry(): > > > > > > /* > > > * Keep reclaiming pages while there is a chance this will lead > > > * somewhere. If none of the target zones can satisfy our alloca= tion > > > * request even if all reclaimable pages are considered then we a= re > > > * screwed and have to go OOM. > > > */ > > > for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, > > > ac->highest_zoneidx, ac->nodemask) { > > > > > > [...snip...] > > > > > > /* > > > * Would the allocation succeed if we reclaimed all > > > * reclaimable pages? > > > */ > > > wmark =3D __zone_watermark_ok(zone, order, min_wmark, > > > ac->highest_zoneidx, alloc_flags, available); > > > > > > if (wmark) { > > > ret =3D true; > > > break; > > > } > > > } > > > > > > ... which is called in __alloc_pages_slowpath. I wonder why we don't = already > > > hit this. It seems to do the same thing your patch is doing? > > > > I checked the number of calls and the time spent for several functions > > called by __alloc_pages_slowpath(), and found that time is spent in > > __alloc_pages_direct_reclaim() before reaching the first should_reclaim= _retry(). > > > > After a few minutes have passed and the debug code that automatically > > resets numa_demotion_enabled to false is executed, it appears that > > __alloc_pages_direct_reclaim() immediately exits. > > First of all is this MGLRU or traditional reclaim? Or both? The behavior is almost the same whether MGLRU is enabled or not. However, one difference is that __alloc_pages_direct_reclaim() may be called multiple times when __alloc_pages_slowpath() is called, and should_reclaim_retry() also returns true several times. This is probably because the watermark check in should_reclaim_retry() considers not only NR_FREE_PAGES but also NR_ZONE_INACTIVE_ANON and NR_ZONE_ACTIVE_ANON as potential free memory. (zone_reclaimable_pages()) The following is the increment of stats in /proc/vmstat from the start of the reproduction test until the problem occurred and numa_demotion_enabled was automatically reset by the debug code and OOM occurred a few minutes later: workingset_nodes 578 workingset_refault_anon 5054381 workingset_refault_file 41502 workingset_activate_anon 3003283 workingset_activate_file 33232 workingset_restore_anon 2556549 workingset_restore_file 27139 workingset_nodereclaim 3472 pgdemote_kswapd 121684 pgdemote_direct 23977 pgdemote_khugepaged 0 pgdemote_proactive 0 pgsteal_kswapd 3480404 pgsteal_direct 2602011 pgsteal_khugepaged 74 pgsteal_proactive 0 pgscan_kswapd 93334262 pgscan_direct 227649302 pgscan_khugepaged 1232161 pgscan_proactive 0 pgscan_direct_throttle 18 pgscan_anon 320480379 pgscan_file 1735346 pgsteal_anon 5828270 pgsteal_file 254219 > Then another thing I've noticed only now. There seems to be a layering > discrepancy (for traditional LRU reclaim) when get_scan_count which > controls the to-be-reclaimed lrus always relies on can_reclaim_anon_pages > while down the reclaim path shrink_folio_list tries to be more clever > and avoid demotion if it turns out to be inefficient. > > I wouldn't be surprised if get_scan_count predominantly (or even > exclusively) scanned anon LRUs only while increasing the reclaim > priority (so essentially just checked all anon pages on the LRU list) > before concluding that it makes no sense. This can take quite some time > and in the worst case you could be recycling couple of page cache pages > remaining on the list to make small but sufficient progress to loop > around. > > So I think the first step is to make the demotion behavior consistent. > If demotion fails then it would probably makes sense to set sc->no_demoti= on > so that get_scan_count can learn from the reclaim feedback that > anonymous pages are not a good reclaim target in this situation. But the > whole reclaim path needs a careful review I am afraid. If migrate_pages() in demote_folio_list() detects that it cannot migrate any folios and all calls to alloc_demote_folio() also fail (this is made possible by adding a few fields to migration_target_control), it sets sc->no_demotion to true, which also resolves the issue. migrate_pages(demote_folios, alloc_demote_folio, NULL, (unsigned long)&mtc, MIGRATE_ASYNC, MR_DEMOTION, &nr_succeeded); if (!nr_succeeded && mtc.nr_alloc_tried > 0 && (mtc.nr_alloc_tried =3D=3D mtc.nr_alloc_failed)) { sc->no_demotion =3D 1; }