From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6BA27C25B78 for ; Tue, 4 Jun 2024 12:29:38 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id D06C76B00BE; Tue, 4 Jun 2024 08:29:37 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CB6B36B00BF; Tue, 4 Jun 2024 08:29:37 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id B7F7C6B00C0; Tue, 4 Jun 2024 08:29:37 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 949176B00BE for ; Tue, 4 Jun 2024 08:29:37 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id 3D34D8020E for ; Tue, 4 Jun 2024 12:29:37 +0000 (UTC) X-FDA: 82193137194.14.079CB78 Received: from mail-ed1-f43.google.com (mail-ed1-f43.google.com [209.85.208.43]) by imf06.hostedemail.com (Postfix) with ESMTP id 3F65418000E for ; Tue, 4 Jun 2024 12:29:34 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="s/IFGKz7"; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1717504174; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=hLHDgcmNggdNHf3EzeUOZpycgUaWBgXIJ5YvKPY39Kc=; b=Wh3cRU4PX1TQIkmwye1Nefcc2TB8T6x7iCeNwzk8dXh+2AUVi02TeNPStiw7NcTioWZZae VBIPaR3a61jpBWQTjcOyKgq5AZEUBY/+mQI9p1GvpWqmR5h8Tr1qUgknEpMVyfIbyZzIbj oPjAJ80ldR4/P39sF3xYFLR+WuZFGAI= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b="s/IFGKz7"; dmarc=pass (policy=none) header.from=cmpxchg.org; spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.208.43 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1717504174; a=rsa-sha256; cv=none; b=1TvXSnTmjsj5jK8UiCXHtPQR9iao4mMW/y4ksqxaN+/5qkk4CWetLlOXsibVAj4DKDDVh7 yktDLIZvn7g2QOPbrlbyJnfWEuOmrfRDzScToXWwKRMtYinvjo5N3FVTDB6t81YMWawl1Y jxSwPA1+zK5UxBOC+iERKG4Gka6I66Q= Received: by mail-ed1-f43.google.com with SMTP id 4fb4d7f45d1cf-57a1fe639a5so1230440a12.1 for ; Tue, 04 Jun 2024 05:29:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1717504172; x=1718108972; darn=kvack.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=hLHDgcmNggdNHf3EzeUOZpycgUaWBgXIJ5YvKPY39Kc=; b=s/IFGKz7UeZPEtYtJixcif5KODznqxpuuxCULr59sSuAuExQqGhQQtRp7TGb/yqw2L D5JrC6Z8vM/aLZc7PTpb4dbr00yEY5KEsOI2+lTVgzvbYU29m3xfqF7pP/V326dcNgR7 CMdXJfCvRO8RaN5BYNCQFSAWMqSRbvQkbjYhYkNH963kiS49OY7PhmIq99ONDoYm7qMZ MplXKOWs/lpq+shApTdKw4frQtKLrfIc9Lqjh+MPa77pkxOubelMZWPQp1zhArnD/zoJ ANOi9sWsREYMZFNDdRRwnhpUGxa6XesGbfPApuyYayxgmT2SY9P6TgNt46hu9f5wnMLd f3cQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1717504172; x=1718108972; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=hLHDgcmNggdNHf3EzeUOZpycgUaWBgXIJ5YvKPY39Kc=; b=oTw7CfKuQ2FqRXk1l+/UTwKS0g8KyLYR5i8zb+tTx2g1UrM08MJ/vGn3WGz4XRytiC JrY3FkDTcHgtdzeRAFt90dVU43eLHwyDJ/F7XDOuDnJ83INXLnjKhrrVKDRM+8BXfGQo BxolNobfBH5ma7zJdbTunnZ4u0DMxRKJl9L5WetMBh+SSCfhPwU5kBzfFCSiAEJxKWmh b4bMAGi7nUFOC9piwL3NshV36Bhi86RXpVH2wRFCw6VKdpElDUCH96PHBE+tKKrq1p8f CmSVMYynFegEYPJUg8cGS5qLYtYyoP5OlqGCHQRTlxmxMiac/V206vLC+LMUuzZn//3M sXXQ== X-Forwarded-Encrypted: i=1; AJvYcCWnipJXvOGzTJzitLdb++51u8NljslZHXtJEedD67X0PplcfN7xSLNpRqtqtvPPUDhq7xUw9lWfsc1xE5MXIDuqR+Y= X-Gm-Message-State: AOJu0YxYayW7FFVtQYtrNNH276pWOnyIYYRfEV3JDEm2pS0/XqLRT2+5 KK+CCrphjzNtZnTR2TGo7oI3bTrqbiccI9dssvonOr/MWHX6mOYqSfi5x2RvISM= X-Google-Smtp-Source: AGHT+IEMJjtJrcTRj85tPaJ4W9oFQCsw8xdnrniijkOBxiLJc3N5hYosRv1H29O7yEnJJ18jEBVoBw== X-Received: by 2002:a50:d657:0:b0:57a:33a5:9b78 with SMTP id 4fb4d7f45d1cf-57a364496famr7231665a12.34.1717504172491; Tue, 04 Jun 2024 05:29:32 -0700 (PDT) Received: from localhost ([2620:10d:c092:600::1:9fa1]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-57a31bb842asm7322606a12.36.2024.06.04.05.29.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 04 Jun 2024 05:29:31 -0700 (PDT) Date: Tue, 4 Jun 2024 08:29:27 -0400 From: Johannes Weiner To: Byungchul Park Cc: "Huang, Ying" , akpm@linux-foundation.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, kernel_team@skhynix.com, iamjoonsoo.kim@lge.com, rientjes@google.com Subject: Re: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now Message-ID: <20240604122927.GA1992@cmpxchg.org> References: <20240604072323.10886-1-byungchul@sk.com> <87bk4hcf7h.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240604084533.GA68919@system.software.com> <8734ptccgi.fsf@yhuang6-desk2.ccr.corp.intel.com> <20240604091221.GA28034@system.software.com> <20240604102516.GB28034@system.software.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20240604102516.GB28034@system.software.com> X-Rspamd-Queue-Id: 3F65418000E X-Stat-Signature: kzrd57h7pg66xn4136dgbft31y67b6xg X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1717504174-710530 X-HE-Meta: U2FsdGVkX1+a7ZKvwFPQS66KD4/I2RHZ9oe0QwmxRfzHSIpzIVMi9eAJF4qT28kTtLegODmneV2moaiek6eLhrV2o3tKXZo19hb1FS1eMwB8aXHoyOz+AVZC3uZc514CY61vgdzixkIe2jORt8jF/qLn9D+YUqB8m+oQy2WBqyaarxWAFrTql/1n49cuvvJPUB8gVwtJ01r8d4GQcvh8mo0yeNrs0HaWoc5h1quaRRmfrqdlJSrkJKfswO98h/LmJttSp9KPLpYVMg69ApxOVLyFoOau/mapcTynsnkr/djjGlok5XMRyyEZOvZMSC5RjcY3Xb8LHs/C8o/dm5r3/Mamcz3EecXH5DeYaT12Edvf07rQkK+isSy35nO18PdU/3vyTczaJ/8DI2w5L2SA3FAaARcDOgQ5ThqYWdo6b6H6dXWufDORqeWJVZOHRtFFLe98S9QQezrRqFr3sgbCeSoT64HnpfGbB2uv4xizOAGGgoNuR98sITWXbHt386L8bV7QOXqiqO6C2CT1+kq9aYG4vXmxf5VgKdq6cig15hxNiK4Qy3y0VET1rWRoQnR9paMDx6Sno3Ohj1dMD5hfCgyPtAVltTK1U4RAXQFG/kCx9wLKR8sN6ERgecDpe+oJLxRE7WXhhDNxLueXhWgKTHfKP/dqiZoVU1Xp6YMvJPBJFsvQYf0aFrzdxh7GySA7cD2Pw67DYk2JiuP2tz78PQGOa/f7IVkoxmkuxLy72rYCShsORoXQZ8BqrI83x+fLXbBs0sf30CLSV0KEjaMeTGYyAOK5SLfoWMRvtnFoxS2fnEzzL3jot/xYzbFd5RM4t0c1r1A4QU50z1VuS/92x6dMU+KD+e7jWjNxXXA3T+P6cnXWQtD4b8R5KzF1744kbreqa4wDJss95aYjnTykiA+vrulmMCZG8XisNW992G89t70kidpAbMFg8RxMPz1s+mls1VimOci4HSeXp0o RnkCEIkg vlp/Rr1wkLSIB7PW0MlMsYQRoT3MZbCx2RQH5m/KaA/euF1SMvIvcBr0mYqZQCaCQnbl5YkhHvL8zBCN4MN+4X6mFk2X5kcV1omgES/eV9HOPx4Zd/o6Zii4qZALo/yoYC+qN7JJyB8RMPhDCIkeHL/1/wykrUHP6bu5BF/AWNWT2xhPT9malBUXn9pK+yA7uVAkswADplz4cOExSfnJdueNvxC8Hc40cFzay3DHy5fx7imZZoA2NgL4DoUwi8QRTrausotFSp9U9xRlYpUNMihN5xMmfm8aV0OVBxDiqWhaJhK9RBkDs89fY82ceh7eJxdS/iFPiYmvXSZHu3D/kBg8rXP/pBah9Akj6aLQNObAPNEkH1mZlyrnKCrLHVitm7qx9Ij/JUdqDSYU828Ut+KOFPMIXgWz+mTidRgRNLsfmAXQXJYhSX9PqRMBSTXv71/tyIr3CdExT5C5DSA2ginXsjV5pyl/k4g2TqUVI8+ibwAgJ/1m1p/y3JTwkDDjaQlrrtLFfhKbsVOChOjk7ULforrR2lSNRW2AqqaR8R8hnx1obyLuXH97LZvMiK8U2Ubpvwr7/nncmnGYLrVJYEjiqlU0XlaDrkwrvwYdBMhjUNKE= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jun 04, 2024 at 07:25:16PM +0900, Byungchul Park wrote: > On Tue, Jun 04, 2024 at 06:12:22PM +0900, Byungchul Park wrote: > > On Tue, Jun 04, 2024 at 04:57:17PM +0800, Huang, Ying wrote: > > > Byungchul Park writes: > > > > > > > On Tue, Jun 04, 2024 at 03:57:54PM +0800, Huang, Ying wrote: > > > >> Byungchul Park writes: > > > >> > > > >> > Changes from v1: > > > >> > 1. Don't allow to resume kswapd if the system is under memory > > > >> > pressure that might affect direct reclaim by any chance, like > > > >> > if NR_FREE_PAGES is less than (low wmark + min wmark)/2. > > > >> > > > > >> > --->8--- > > > >> > From 6c73fc16b75907f5da9e6b33aff86bf7d7c9dd64 Mon Sep 17 00:00:00 2001 > > > >> > From: Byungchul Park > > > >> > Date: Tue, 4 Jun 2024 15:27:56 +0900 > > > >> > Subject: [PATCH v2] mm: let kswapd work again for node that used to be hopeless but may not now > > > >> > > > > >> > A system should run with kswapd running in background when under memory > > > >> > pressure, such as when the available memory level is below the low water > > > >> > mark and there are reclaimable folios. > > > >> > > > > >> > However, the current code let the system run with kswapd stopped if > > > >> > kswapd has been stopped due to more than MAX_RECLAIM_RETRIES failures > > > >> > until direct reclaim will do for that, even if there are reclaimable > > > >> > folios that can be reclaimed by kswapd. This case was observed in the > > > >> > following scenario: > > > >> > > > > >> > CONFIG_NUMA_BALANCING enabled > > > >> > sysctl_numa_balancing_mode set to NUMA_BALANCING_MEMORY_TIERING > > > >> > numa node0 (500GB local DRAM, 128 CPUs) > > > >> > numa node1 (100GB CXL memory, no CPUs) > > > >> > swap off > > > >> > > > > >> > 1) Run a workload with big anon pages e.g. mmap(200GB). > > > >> > 2) Continue adding the same workload to the system. > > > >> > 3) The anon pages are placed in node0 by promotion/demotion. > > > >> > 4) kswapd0 stops because of the unreclaimable anon pages in node0. > > > >> > 5) Kill the memory hoggers to restore the system. > > > >> > > > > >> > After restoring the system at 5), the system starts to run without > > > >> > kswapd. Even worse, tiering mechanism is no longer able to work since > > > >> > the mechanism relies on kswapd for demotion. > > > >> > > > >> We have run into the situation that kswapd is kept in failure state for > > > >> long in a multiple tiers system. I think that your solution is too > > > > > > > > My solution just gives a chance for kswapd to work again even if > > > > kswapd_failures >= MAX_RECLAIM_RETRIES, if there are potential > > > > reclaimable folios. That's it. > > > > > > > >> limited, because OOM killing may not happen, while the access pattern of > > > > > > > > I don't get this. OOM will happen as is, through direct reclaim. > > > > > > A system that fails to reclaim via kswapd may succeed to reclaim via > > > direct reclaim, because more CPUs are used to scanning the page tables. > > > > > > In a system with NUMA balancing based page promotion and page demotion > > > enabled, page promotion will wake up kswapd, but kswapd may fail in some > > > situations. But page promotion will no trigger direct reclaim or OOM. > > > > > > >> the workloads may change. We have a preliminary and simple solution for > > > >> this as follows, > > > >> > > > >> https://git.kernel.org/pub/scm/linux/kernel/git/vishal/tiering.git/commit/?h=tiering-0.8&id=17a24a354e12d4d4675d78481b358f668d5a6866 > > > > > > > > Whether tiering is involved or not, the same problem can arise if > > > > kswapd gets stopped due to kswapd_failures >= MAX_RECLAIM_RETRIES. > > > > > > Your description is about tiering too. Can you describe a situation > > > > I mentioned "tiering" while I described how to reproduce because I ran > > into the situation while testing with tiering system but I don't think > > it's the necessary condition. > > > > Let me ask you back, why the logic to stop kswapd was considered in the > > first place? That's because the problem was already observed anyway > > To be clear.. > > The problem, kswapd_failures >= MAX_RECLAIM_RETRIES, can happen whether > tiering is involved not not. Once kswapd stops, the system should run > without kswapd even after recovered e.g. by killing the hoggers. *Even > worse*, tiering mechanism doesn't work in this situation. But like Ying said, in other situations it's direct reclaim that kicks in and clears the flag. The failure-sleep and direct reclaim triggered recovery have been in place since 2017. Both parties who observed an issue with it recently did so in tiering scenarios. IMO a tiering-specific solution makes the most sense.