From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74DD2C25B79 for ; Thu, 23 May 2024 12:53:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id BC9826B0083; Thu, 23 May 2024 08:53:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B78006B0088; Thu, 23 May 2024 08:53:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A3FE16B0089; Thu, 23 May 2024 08:53:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 84AAE6B0083 for ; Thu, 23 May 2024 08:53:46 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id B881AC05BD for ; Thu, 23 May 2024 12:53:44 +0000 (UTC) X-FDA: 82149652368.06.115A666 Received: from mail-wr1-f42.google.com (mail-wr1-f42.google.com [209.85.221.42]) by imf05.hostedemail.com (Postfix) with ESMTP id C5E9D100022 for ; Thu, 23 May 2024 12:53:42 +0000 (UTC) Authentication-Results: imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h4sXc0p1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of kmanaouil.dev@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=kmanaouil.dev@gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1716468822; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=flAPbWWp57MeHbRK65Hsnt53rZwtZFKtBqvKOTNtyE0=; b=vH4yXmjwPcAn98kkyJVnMzAr93um+jUncXqQ24fDRZsLgU7ycvuwgk8xPKSnyHCMjT92ZD 7THgLM9RQX2F4fMQGapkWWKNvw6uIxPCd4MZwM5XkYozcAyErB1tem+Yv5w6jXkvl8Xkef wXpiJZ+zvqeSItKI8S+g7WOr4dXUNyk= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1716468822; a=rsa-sha256; cv=none; b=ZdX6BBCC85weefFLVO/NmdR6On3nydc+bQjTlGnHi8tf7IbVbf7mUnBGajRVo8dJ8dlJpl R92Ii7EQr93h3BWuXlT/BZSLYuSIHX2UErT/Haugx6g0y3V7pC7SHap3mwXbz6YcXc6WMb /V/uv7KpJlUEoKko16fuHaoe74EmcXY= ARC-Authentication-Results: i=1; imf05.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=h4sXc0p1; dmarc=pass (policy=none) header.from=gmail.com; spf=pass (imf05.hostedemail.com: domain of kmanaouil.dev@gmail.com designates 209.85.221.42 as permitted sender) smtp.mailfrom=kmanaouil.dev@gmail.com Received: by mail-wr1-f42.google.com with SMTP id ffacd0b85a97d-354ccdfb014so433416f8f.2 for ; Thu, 23 May 2024 05:53:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1716468821; x=1717073621; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=flAPbWWp57MeHbRK65Hsnt53rZwtZFKtBqvKOTNtyE0=; b=h4sXc0p16KeGS4JjkPUTHfWrask5UDQ2PlQcIxMCn9AT8cpXW1rnLLWhC+RCAEK1Dx 82OsOelDl366cGsci5FjNywbvXQ7AGi5MfOWLoJQ/rirQ8gXFCJEEx5Om2loRjCdPXNe roIb0o0u3PEwZSCj7PaF1UtTFwj6kvwjuJo5iEt5Z0KdZhPcMuo6VLICnDMgsYOJ9cFH w1SgpTs4wHDW+jRBvn7G5ebpY3TtgsQ/MFT3F227inO7xWr+TNPOTCkWvFrm+E0FjcNA 8tYDRz6YnpM8JMUK1qKWSyOaMrqLQhAIXlupbv4xk9AHvC1T9LTwFdKAPQ3RxzIOrH2M pd+g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1716468821; x=1717073621; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=flAPbWWp57MeHbRK65Hsnt53rZwtZFKtBqvKOTNtyE0=; b=vLz5YFP0IvL4F3Y/JoR+EJj2ROrjHLCiIKqUszRvhaj05EnnK0Zn1GqVNPnpE9a4uV jRQbUVqE7dt1raTMLMs4vAEVxxczpkYOyVr6ZwokSeIIj8O7ihrhhRKkPVZS6r//2+iO st04FSWGY+t+X/IM8Hq4WziUTY4HNqr+p0AuBfOGKJDRuMMBngQVYywbHpPstg9IG2Gm M5dZs/O+x2h7uvfmqqiH9m8lsZ/cXiSvK7pvPLQUcqvATyC73Gm6jq5dKvSkQSfwTbkp 4iLr8oLbmofxRIvZ3MEzYxFrOrQ7vwp3I9fgBf/aZxOVfUQLUj8tIeF+WSIcMRzv49wj Tf+A== X-Forwarded-Encrypted: i=1; AJvYcCV3RMg7va895S0+DDDv8yXmnI4ZGHVByjJgZr4oTk7Lp85//kScIEhh+TNQhKqNncHDbhRolST9XDhdZ6VDrAAmVcg= X-Gm-Message-State: AOJu0YxSR2gJV5P6RTmd9zMcSvcfgfUdU9eqGfolkhIAIUazRCVwpjNk xcWbldnZuvV/wb3Cka1+5UY8TwgVOTutOWtxsVfRlKCOD2Fii6GL X-Google-Smtp-Source: AGHT+IFoyyEXfrC0CEdfkh832ZQvtrzOUvHU1PdopnkCbxR7Kmx7ulyApp2//xru1xch/2xRt60S9g== X-Received: by 2002:a05:600c:4f90:b0:420:29dd:84e2 with SMTP id 5b1f17b1804b1-420fd372a13mr35500365e9.2.1716468820775; Thu, 23 May 2024 05:53:40 -0700 (PDT) Received: from localhost.localdomain ([2001:630:3c1:90:1614:6de0:61c7:40b0]) by smtp.gmail.com with ESMTPSA id 5b1f17b1804b1-42100f7b7f9sm24535195e9.27.2024.05.23.05.53.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 23 May 2024 05:53:40 -0700 (PDT) Date: Thu, 23 May 2024 13:53:37 +0100 From: Karim Manaouil To: Byungchul Park Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com Subject: Re: [PATCH] mm: let kswapd work again for node that used to be hopeless but may not now Message-ID: References: <20240523051406.81053-1-byungchul@sk.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240523051406.81053-1-byungchul@sk.com> X-Stat-Signature: m5bdmrqnh4kusrfrrsq7onjj8g75cgn6 X-Rspamd-Queue-Id: C5E9D100022 X-Rspam-User: X-Rspamd-Server: rspam01 X-HE-Tag: 1716468822-123106 X-HE-Meta: U2FsdGVkX1+GmTV8Cz+uP1VpBkC/N84tZjqArw/f1rvhIhjJ0I3xuF0DvUdf3uKT/7kL+uV//NewLjDml5NAd+zKSB2fvF7CdtN5BT6dqEx1Fdsm+1M74yUMHWsVBIj3Cgh+9l6N3Fk9xNGmpWqeWNGiJXOmZaLjYkdz/iSF+O184+DFhwIl7xNW2TC6obycqzdRImwJmdfAZ6TGLPUiVo3l7UGSSqwPvD/BnEneEWA5/816fC8wXTOte+gWbMdqyBFka8VsPWE4UCObSFan5YIASQX9r1ZBs+E3IT3Ddj+b6nH+P/Mcymzg3plA00LOFeLf85frrS7bPVGARpG2bL8wBEzT9WAxlQRs2B7lVpojjbHWPJVwtT+feVskKlK/jtZpDY3CiAiaB09w/J/hiQnhXceViFgRz9GLI2h3G16KPQStjoYXvSI1tb1ggJqfR9cC99uhRca3gvkoEGM0hW4fDiU2MGEmCqUOxXIopaMbgoOC0pyhzZwUI2kPfiEySsqDGqf5iqT3q+aL8/3gpem+QNZYFUr8XsEkczN9BtAqCvpqTKh2FoCmgqc5kNkPcZvORwF1A4tXRr2KMGFW60Fq3qtpPs2bdkJx/TTW/RjMKIWfpkcPN1bRif5hNLKoMlQYCM3oTN6dfC9j5+xUnJQqetvLNKE7Uemhikcp1kDgUt/Myh3zFDYhVP3HkX6uKPvs4CymH6GJ+HsPZCCnQX7KZlVfEuhIe6O4GneDOOv17t5HMZUZZdlyifvgyudDH8HyRfTnjnv/mSAYnv8fKSs/2gTJLMpUKfvWVidfUZZtyuEx02ud6OJoBidpIUT2B4qMrUq6L07hH2Rp/tTFEt/zrjB9TfhlvteI/NFVMF61x7pTPoXmKyRN1LZpiBC21uSgpQRiEiogBXYidvCtdNZeZx9is0Csvl5bhB7gmjNhjWGTfQDCinX5nqPAXknNvIxYcoPJ1J6TzRiBOAn vc8CUAga sB1bKl4uFKR3up3qLQ7uhhDK3hd0EmHdsfIGGkyoEk1kk0lLvXQ5Ejo9qfbDga9BWIh3YroKgFhskUx0QGoe2340bWTFqzXrhfNslCgbiC+nufb3KDP7vRrwkOpKd5HJc3zMZv7qafWaeEFVJY9gQXIUBmVZNA3EIh3tsOPKCwot4aEY/OYX3o0DKE4sVsHeoJ8hXadvJ+g/TYSaZLtbiyzefASd53yL2du/LCmEeuML4Ut9efiCuS8GYRDcj5IckFQaXH1CtBv7OJZYwuc9fIDP4dsfOubO1kW7i0UdjG+z/JJJ6F9S2GPeMBVWPemX3mjt/ajITjXCTCemgdtConEXB9xRtKjKq/Lg0G8Dr13pngQtFMoJkLvLnAwRWLwrauuYkZF39X9BFBH/DYAhBNE53gDfj6pYigHEisrvoHxY13M1JhicQbdWtR3Ga5+RxRMvaKhINqHjn7+N/D1jj4GQogUiTLpvHYWW0EbafXOoArw82YE3443iJ/tzmsXQmsHnqt07Kh9I0cGbRsy+HQHFryg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Thu, May 23, 2024 at 02:14:06PM +0900, Byungchul Park wrote: > I suffered from kswapd stopped in the following scenario: > > CONFIG_NUMA_BALANCING enabled > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > numa node0 (500GB local DRAM, 128 CPUs) > numa node1 (100GB CXL memory, no CPUs) > swap off > > 1) Run any workload using a lot of anon pages e.g. mmap(200GB). > 2) Keep adding another workload using a lot of anon pages. > 3) The DRAM becomes filled with only anon pages through promotion. > 4) Demotion barely works due to severe memory pressure. > 5) kswapd for node0 stops because of the unreclaimable anon pages. It's not very clear to me, but if I understand correctly, if you have free memory on CXL, kswapd0 should not stop as long as demotion is successfully migrating the pages from DRAM to CXL and returns that as nr_reclaimed in shrink_folio_list()? If that's the case, kswapd0 is making progress and shouldn't give up. If CXL memory is also filled and migration fails, then it doesn't make sense to me to wake up kswapd0 as it obvisoly won't help with anything, because, you guessed it, you have no memory in the first place!! > 6) Manually kill the memory hoggers. > 7) kswapd is still stopped even though the system got back to normal. > > From now on, the system should run without reclaim service in background > served by kswapd until direct reclaim will do for that. Even worse, > tiering mechanism is no longer able to work because kswapd has stopped > that the mechanism relies on. > > However, after 6), the DRAM will be filled with pages that might or > might not be reclaimable, that depends on how those are going to be used. > Since those are potentially reclaimable anyway, it's worth hopefully > trying reclaim by allowing kswapd to work again if needed. > > Signed-off-by: Byungchul Park > --- > include/linux/mmzone.h | 4 ++++ > mm/page_alloc.c | 12 ++++++++++++ > mm/vmscan.c | 21 ++++++++++++++++----- > 3 files changed, 32 insertions(+), 5 deletions(-) > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > index c11b7cde81ef..7c0ba90ea7b4 100644 > --- a/include/linux/mmzone.h > +++ b/include/linux/mmzone.h > @@ -1331,6 +1331,10 @@ typedef struct pglist_data { > enum zone_type kswapd_highest_zoneidx; > > int kswapd_failures; /* Number of 'reclaimed == 0' runs */ > + int nr_may_reclaimable; /* Number of pages that have been > + allocated since considered the > + node is hopeless due to too many > + kswapd_failures. */ > > #ifdef CONFIG_COMPACTION > int kcompactd_max_order; > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index 14d39f34d336..1dd2daede014 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -1538,8 +1538,20 @@ inline void post_alloc_hook(struct page *page, unsigned int order, > static void prep_new_page(struct page *page, unsigned int order, gfp_t gfp_flags, > unsigned int alloc_flags) > { > + pg_data_t *pgdat = page_pgdat(page); > + > post_alloc_hook(page, order, gfp_flags); > > + /* > + * New pages might or might not be reclaimable depending on how > + * these pages are going to be used. However, since these are > + * potentially reclaimable, it's worth hopefully trying reclaim > + * by allowing kswapd to work again even if there have been too > + * many ->kswapd_failures, if ->nr_may_reclaimable is big enough. > + */ > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > + pgdat->nr_may_reclaimable += 1 << order; > + > if (order && (gfp_flags & __GFP_COMP)) > prep_compound_page(page, order); > > diff --git a/mm/vmscan.c b/mm/vmscan.c > index 3ef654addd44..5b39090c4ef1 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -4943,6 +4943,7 @@ static void lru_gen_shrink_node(struct pglist_data *pgdat, struct scan_control * > done: > /* kswapd should never fail */ > pgdat->kswapd_failures = 0; > + pgdat->nr_may_reclaimable = 0; > } > > /****************************************************************************** > @@ -5991,9 +5992,10 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc) > * sleep. On reclaim progress, reset the failure counter. A > * successful direct reclaim run will revive a dormant kswapd. > */ > - if (reclaimable) > + if (reclaimable) { > pgdat->kswapd_failures = 0; > - else if (sc->cache_trim_mode) > + pgdat->nr_may_reclaimable = 0; > + } else if (sc->cache_trim_mode) > sc->cache_trim_mode_failed = 1; > } > > @@ -6636,6 +6638,11 @@ static void clear_pgdat_congested(pg_data_t *pgdat) > clear_bit(PGDAT_WRITEBACK, &pgdat->flags); > } > > +static bool may_recaimable(pg_data_t *pgdat, int order) > +{ > + return pgdat->nr_may_reclaimable >= 1 << order; > +} > + > /* > * Prepare kswapd for sleeping. This verifies that there are no processes > * waiting in throttle_direct_reclaim() and that watermarks have been met. > @@ -6662,7 +6669,8 @@ static bool prepare_kswapd_sleep(pg_data_t *pgdat, int order, > wake_up_all(&pgdat->pfmemalloc_wait); > > /* Hopeless node, leave it to direct reclaim */ > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES) > + if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > + !may_recaimable(pgdat, order)) > return true; > > if (pgdat_balanced(pgdat, order, highest_zoneidx)) { > @@ -6940,8 +6948,10 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx) > goto restart; > } > > - if (!sc.nr_reclaimed) > + if (!sc.nr_reclaimed) { > pgdat->kswapd_failures++; > + pgdat->nr_may_reclaimable = 0; > + } > > out: > clear_reclaim_active(pgdat, highest_zoneidx); > @@ -7204,7 +7214,8 @@ void wakeup_kswapd(struct zone *zone, gfp_t gfp_flags, int order, > return; > > /* Hopeless node, leave it to direct reclaim if possible */ > - if (pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES || > + if ((pgdat->kswapd_failures >= MAX_RECLAIM_RETRIES && > + !may_recaimable(pgdat, order)) || > (pgdat_balanced(pgdat, order, highest_zoneidx) && > !pgdat_watermark_boosted(pgdat, highest_zoneidx))) { > /* > -- > 2.17.1 > >