From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B33D6C25B75 for ; Fri, 24 May 2024 00:47:43 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D20326B007B; Thu, 23 May 2024 20:47:42 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CD1226B0083; Thu, 23 May 2024 20:47:42 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B977D6B0085; Thu, 23 May 2024 20:47:42 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 9B67C6B007B for ; Thu, 23 May 2024 20:47:42 -0400 (EDT) Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 0DD07C0765 for ; Fri, 24 May 2024 00:47:42 +0000 (UTC) X-FDA: 82151451564.18.DBEF41A Received: from invmail4.hynix.com (exvmail4.hynix.com [166.125.252.92]) by imf07.hostedemail.com (Postfix) with ESMTP id 5C56040002 for ; Fri, 24 May 2024 00:47:39 +0000 (UTC) Authentication-Results: imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716511660; a=rsa-sha256; cv=none; b=BA3Hfz76qPiT6ao3X19+lMdvKODXJD0a4j4lLa/PXTXiUoNA57yNUg5M3bOkTISp7y3VnM 9ydt6WdLK6fFJUePzSik7eh3pdHdDAkB0/8uNRkLCBJmdrwyG5QDzxbAwjSVBX32MzoGHr 5Q1efeknwwySNAOu+ikaJWO0A+bXL2w= ARC-Authentication-Results: i=1; imf07.hostedemail.com; dkim=none; spf=pass (imf07.hostedemail.com: domain of byungchul@sk.com designates 166.125.252.92 as permitted sender) smtp.mailfrom=byungchul@sk.com; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716511660; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=QGkJyL7GsdQUV7Plyxb5u2uD8j0ACtzIb2ZWusbqGQA=; b=PC1igQ4uls50VpGpAH4ovEdvYbuvN1Xs2sO9AEp4bv+hZ+y2LwrI291JYIrzncTAxuh09C lnhEoxuzCUBaGbDC7O0F5dJfJBi4+t+4YREnW4ymOb0YJ0AsIs4b8vNwSIa/nQGNnn81kn cetrcZHJQD4Xm1rngak51+Of3uoA7FM= X-AuditID: a67dfc5b-d6dff70000001748-40-664fe3a87ca5 Date: Fri, 24 May 2024 09:47:31 +0900 From: Byungchul Park To: Karim Manaouil Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com Subject: Re: [PATCH] mm: let kswapd work again for node that used to be hopeless but may not now Message-ID: <20240524004731.GA78958@system.software.com> References: <20240523051406.81053-1-byungchul@sk.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFjrILMWRmVeSWpSXmKPExsXC9ZZnke6Kx/5pBs/n6lvMWb+GzeLthO2M Fpd3zWGzuLfmP6sDi8fOWXfZPTZ9msTucWLGbxaPz5vkAliiuGxSUnMyy1KL9O0SuDLOnG1j KlhkXjF13SPmBsafWl2MnBwSAiYSU39+YoexJ35dzwZiswioSvxuWsoMYrMJqEvcuPETzBYR 0JGY8LQdqJ6Dg1kgS2J6axRIWFggSaLn3gYmEJtXwEJix4J1jCC2kECCxIZvPcwQcUGJkzOf sIDYzAJaEjf+vWSCGCMtsfwfB0iYU8BU4tqNxWAlogLKEge2HQcq4QK6bAKbxMYDm6HOlJQ4 uOIGywRGgVlIxs5CMnYWwtgFjMyrGIUy88pyEzNzTPQyKvMyK/SS83M3MQIDdlntn+gdjJ8u BB9iFOBgVOLhPaDvnybEmlhWXJl7iFGCg1lJhDd6pW+aEG9KYmVValF+fFFpTmrxIUZpDhYl cV6jb+UpQgLpiSWp2ampBalFMFkmDk6pBsbcxr6jDaEmSR5HL9zmqIy49NWiZP9Fm7li3Ndv 50YJGXsbzaluSkp9u8Rdrl9glrH4tiXP3x23lLPc9XFdlugXI4U1x346rGacJpC11qr2Xt23 hXPfnM25w7d29iOJXRmvagVSV53ql8rbddvms8UVftbAPQz2ETGXN37/X13eunK205TS485K LMUZiYZazEXFiQChxJnnVAIAAA== X-Brightmail-Tracker: H4sIAAAAAAAAA+NgFlrBLMWRmVeSWpSXmKPExsXC5WfdrLvisX+aweT/chZz1q9hszg89ySr xdsJ2xktLu+aw2Zxb81/VgdWj52z7rJ7bPo0id3jxIzfLB6LX3xg8vi8SS6ANYrLJiU1J7Ms tUjfLoEr48zZNqaCReYVU9c9Ym5g/KnVxcjJISFgIjHx63o2EJtFQFXid9NSZhCbTUBd4saN n2C2iICOxISn7exdjBwczAJZEtNbo0DCwgJJEj33NjCB2LwCFhI7FqxjBLGFBBIkNnzrYYaI C0qcnPmEBcRmFtCSuPHvJRPEGGmJ5f84QMKcAqYS124sBisRFVCWOLDtONMERt5ZSLpnIeme hdC9gJF5FaNIZl5ZbmJmjqlecXZGZV5mhV5yfu4mRmD4Lav9M3EH45fL7ocYBTgYlXh4D+j7 pwmxJpYVV+YeYpTgYFYS4Y1e6ZsmxJuSWFmVWpQfX1Sak1p8iFGag0VJnNcrPDVBSCA9sSQ1 OzW1ILUIJsvEwSnVwLiu57xNe9zCmbnd9/5VZT/9omHDu7BptpKNpL3zv3UBj1uktm479v+T RHTRZkfN/M3sy2tuTupuNtTWf7+Ay7XkSOivS3XNzdtD7gU/27xc+3PQ9eBLM/y5AmdbsJpt u/VVjqtp5uPf/7bPadn5L6je+uzVuZlr3tZ8UePxXRApOGH604X6Rh+UWIozEg21mIuKEwER keXAOwIAAA== X-CFilter-Loop: Reflected X-Rspam-User: X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 5C56040002 X-Stat-Signature: n8brf4u535o6tgyymifgzjz3tmef81zd X-HE-Tag: 1716511659-605092 X-HE-Meta: U2FsdGVkX19kA4SGbqzJX20XNSSN76RiXLfaNzFLwb9pgx31KttLxoFJENWp9k7f2gUiGjxLGCt3Bzi7biU4Tn2J1LsthZTRhXXgB14u9k3i1QFSjl/5x0deta8KnoPHtg7OT2/QTQXeTE7Gw/dxPvFVtWuXLY+9xsufwt97mhYtNdrjvyjVVKygVF10qde8OLJJheA4giGXHj0gzX/bhwiu4js6xO4T+6oTw+r82D/K5+5i+l1sUwASqHFcA8XQiop4Lev3qq1vPhKm6u4yho3DDaVG3V9N3u8bABvW/ztYNxeaYuUIk1HVqeIPPregOaK0QwjDbpuR0gJgVKkLKD6gE6hXObD4EfUvghDawCIOdMeDZObYmLZxGcH3G8pg20+kAUgQJB2/RYh3yuKLeuykILn8qM3Rxt/ltu1wXpVa5dKss/z0Tz+855WxFZqZdZpqhY3AIda1py+kFAWtr5bjWgdjwevFFAzr+vOwVPgMeqagJdtMxW4n1+UfYGgs7f2okVUH5WVoG40jfeCSqXOAMV64AoDWuucTgkt1AT42db4Msu7ToL93us8212rp6ESkKFP7aOeCTWRh3c5wqiruTZrl/QjJW5oHgbYLVPp0eKTHsSSG82FNVcI98WtO1QZWugvBwiWZBqy/d4ExdK6moyP5Hlxw6+9IAf7Uu8kPi2wpYY4os+dFUGPjRSJjq2hspVq2fvuUjFAmlKqBkMsXWumiajxqTQYb1ZIPsWS9Hvxa6Fl6fNmEBfeUC7l5xcZQlv9g60XGIZoIIHqoJwzDgjlfNv0U1ZiM8KTu8OjLx1a7O9/Q6cytX3quDDelYgOMvz2LMWjFat7CrHPvp5MwGIWXsppROMAEOHCfqXiAYRm5G1w21hdwVwUglysklgkNmnKAtuuqyx3umKaafJjyCZvCiG2g8J/q88MfCIgqZc50fV7MruyAVVAATj0dsf4AxiNlfk4Vj1vGPPN 9lAbgRpm JdQVlCK4bCaj7rbUgErCRTvkgDUrhEw7Rvy9nuFAF8Yk1TJuYUzIn/ZISZKY78yVoeCkR X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 23, 2024 at 01:53:37PM +0100, Karim Manaouil wrote: > On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote: > > I suffered from kswapd stopped in the following scenario: > > > > CONFIG_NUMA_BALANCING enabled > > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > > numa node0 (500GB local DRAM, 128 CPUs) > > numa node1 (100GB CXL memory, no CPUs) > > swap off > > > > 1) Run any workload using a lot of anon pages e.g. mmap(200GB). > > 2) Keep adding another workload using a lot of anon pages. > > 3) The DRAM becomes filled with only anon pages through promotion. > > 4) Demotion barely works due to severe memory pressure. > > 5) kswapd for node0 stops because of the unreclaimable anon pages. > > It's not very clear to me, but if I understand correctly, if you have I don't have free memory on CXL. > free memory on CXL, kswapd0 should not stop as long as demotion is kswapd0 stops because demotion barely works. > successfully migrating the pages from DRAM to CXL and returns that as > nr_reclaimed in shrink_folio_list()? > > If that's the case, kswapd0 is making progress and shouldn't give up. It's not the case. > If CXL memory is also filled and migration fails, then it doesn't make > sense to me to wake up kswapd0 as it obvisoly won't help with anything, It's true *only* when it won't help with anything. However, kswapd should work again once the system got back to normal e.g. by terminating the anon hoggers. I addressed this issue. > because, you guessed it, you have no memory in the first place!! > > > 6) Manually kill the memory hoggers. This is the point. Byungchul > > 7) kswapd is still stopped even though the system got back to normal. > > > > From now on, the system should run without reclaim service in background > > served by kswapd until direct reclaim will do for that. Even worse, > > tiering mechanism is no longer able to work because kswapd has stopped > > that the mechanism relies on. > > > > However, after 6), the DRAM will be filled with pages that might or > > might not be reclaimable, that depends on how those are going to be used. > > Since those are potentially reclaimable anyway, it's worth hopefully > > trying reclaim by allowing kswapd to work again if needed. > > > > Signed-off-by: Byungchul Park > > --- > > include/linux/mmzone.h | 4 ++++ > > mm/page_alloc.c | 12 ++++++++++++ > > mm/vmscan.c | 21 ++++++++++++++++----- > > 3 files changed, 32 insertions(+), 5 deletions(-) > > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index c11b7cde81ef..7c0ba90ea7b4 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -1331,6 +1331,10 @@ typedef struct pglist_data { > > enum zone_type kswapd_highest_zoneidx; > > > > int kswapd_failures; /* Number of 'reclaimed == 0' runs */ > > + int nr_may_reclaimable; /* Number of pages that have been > > + allocated since considered the > > + node is hopeless due to too many > > + kswapd_failures. */ > > > > #ifdef CONFIG_COMPACTION > > int kcompactd_max_order; > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > index 14d39f34d336..1dd2daede014 100644 > > --- a/mm/page_alloc.c > > +++ b/mm/page_alloc.c > > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, > > unsigned int alloc_flags) > > { > > + pg_data_t *pgdat = page_pgdat(page); > > + > > post_alloc_hook(page, order, gfp_flags); > > > > + /* > > + * New pages might or might not be reclaimable depending on how > > + * these pages are going to be used. However, since these are > > + * potentially reclaimable, it's worth hopefully trying reclaim > > + * by allowing kswapd to work again even if there have been too > > + * many ->kswapd_failures, if ->nr_may_reclaimable is big enough. > > + */ > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + pgdat->nr_may_reclaimable += 1 << order; > > + > > if (order && (gfp_flags & __GFP_COMP)) > > prep_compound_page(page, order); > > > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 3ef654addd44..5b39090c4ef1 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control * > > done: > > /* kswapd should never fail */ > > pgdat->kswapd_failures = 0; > > + pgdat->nr_may_reclaimable = 0; > > } > > > > /****************************************************************************** > > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > > * sleep. On reclaim progress, reset the failure counter. A > > * successful direct reclaim run will revive a dormant kswapd. > > */ > > - if (reclaimable) > > + if (reclaimable) { > > pgdat->kswapd_failures = 0; > > - else if (sc->cache_trim_mode) > > + pgdat->nr_may_reclaimable = 0; > > + } else if (sc->cache_trim_mode) > > sc->cache_trim_mode_failed = 1; > > } > > > > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat) > > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > > } > > > > +static bool may_recaimable(pg_data_t *pgdat, int order) > > +{ > > + return pgdat->nr_may_reclaimable >= 1 << order; > > +} > > + > > /* > > * Prepare kswapd for sleeping. This verifies that there are no processes > > * waiting in throttle_direct_reclaim() and that watermarks have been met. > > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, > > wake_up_all(&pgdat->pfmemalloc_wait); > > > > /* Hopeless node, leave it to direct reclaim */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) > > return true; > > > > if (pgdat_balanced(pgdat, order, highest_zoneidx)) { > > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > > goto restart; > > } > > > > - if (!sc.nr_reclaimed) > > + if (!sc.nr_reclaimed) { > > pgdat->kswapd_failures++; > > + pgdat->nr_may_reclaimable = 0; > > + } > > > > out: > > clear_reclaim_active(pgdat, highest_zoneidx); > > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > > return; > > > > /* Hopeless node, leave it to direct reclaim if possible */ > > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || > > + if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > > + !may_recaimable(pgdat, order)) || > > (pgdat_balanced(pgdat, order, highest_zoneidx) && > > !pgdat_watermark_boosted(pgdat, highest_zoneidx))) { > > /* > > -- > > 2.17.1 > > > >