From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id B832DC3DA59 for ; Mon, 22 Jul 2024 14:08:01 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4CD876B007B; Mon, 22 Jul 2024 10:08:01 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 455986B0083; Mon, 22 Jul 2024 10:08:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F73F6B0085; Mon, 22 Jul 2024 10:08:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 10F166B007B for ; Mon, 22 Jul 2024 10:08:01 -0400 (EDT) Received: from smtpin30.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay08.hostedemail.com (Postfix) with ESMTP id B510914185E for ; Mon, 22 Jul 2024 14:08:00 +0000 (UTC) X-FDA: 82367567520.30.AD7C0A3 Received: from mgamail.intel.com (mgamail.intel.com [192.198.163.19]) by imf12.hostedemail.com (Postfix) with ESMTP id 3401D4001D for ; Mon, 22 Jul 2024 14:07:57 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VHQO925r; spf=none (imf12.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 192.198.163.19) smtp.mailfrom=kirill.shutemov@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721657232; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=8JyEs2rntiJG8gMWWZJG0lK+UWYCbQr5Pg/GsnjbAiQ=; b=xiP5sYQFMoOSqrb8BLbWPVfis2Oql3uszK8mKXu6j5h3FxX1dBISWhXMrA3f0X9f/Mw5Sw 8yeS+fx7lSEKg/nI4R7jb190J5KrrHVnpFDONFs3IJxVXXeOA+qOJi54lqbwF76y6UPrIN lSLsOIGKZQhHSkkCG17q3cocvucATCY= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721657232; a=rsa-sha256; cv=none; b=o4i9iOUkU7hvq1ER5rYilBVIHzvCjkRzZeYHf0KAmSMZC4ve0bz20U3xk5qDKdmccDFFK4 5b7lbo/+9RoypwASSK+2uL7ilFYU9nAqlsFNEjC3BRSv5z62GHPp2/CjRkvyUzr1BxhAET E/NkPNwanjjZZE4KYH69eZVGByXCuck= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=VHQO925r; spf=none (imf12.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 192.198.163.19) smtp.mailfrom=kirill.shutemov@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721657278; x=1753193278; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=+eZHdJDabvhe4yJl+I++bXRcuLcMT5qaWJvUi9IaS8A=; b=VHQO925rkFnZO6wJnRhzk8njwDFsIbyIx1Ydx82tt+Hy3BCfl9HHgp/Z A+jupAvxF2Ts0vf4uIKCQDzHIUUGZ44OJt7Vcu1zR+U2uPwdlqJGp/We1 hY139+d78IQhjfUhVS7Y0PrKHkbD9983JO+6yChgCjcZotvWEgIKL9XUf 9C3XG+7pRY5LjceKAPaFjaD/GyTKVrxNnzWX2jDpjz0Iw1g12M+7m3emZ iyjjprQ/p7vBAR/TthYfh24lpsYmshAvuNGV+mku2zR56sn4GdTmRL6al EcFVhX1E0GFJlzigyQK5gKvC2yi7gGvDrLx6ynxB0fCb8OBMd3/wolVzi g==; X-CSE-ConnectionGUID: JFaIdNyyQ1GX/y7Hb4bN/g== X-CSE-MsgGUID: NE9Nb1QeS7ifeLfcamEiYw== X-IronPort-AV: E=McAfee;i="6700,10204,11141"; a="18932539" X-IronPort-AV: E=Sophos;i="6.09,228,1716274800"; d="scan'208";a="18932539" Received: from orviesa002.jf.intel.com ([10.64.159.142]) by fmvoesa113.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 22 Jul 2024 07:07:49 -0700 X-CSE-ConnectionGUID: ofWiXJEZQwS3NF984WTZng== X-CSE-MsgGUID: wQZBfRJeTvOFGUBExfTpLw== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,228,1716274800"; d="scan'208";a="82537212" Received: from black.fi.intel.com ([10.237.72.28]) by orviesa002.jf.intel.com with ESMTP; 22 Jul 2024 07:07:47 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id E0A04540; Mon, 22 Jul 2024 17:07:44 +0300 (EEST) Date: Mon, 22 Jul 2024 17:07:44 +0300 From: "Kirill A. Shutemov" To: Michal Hocko Cc: Andrew Morton , "Borislav Petkov (AMD)" , Mel Gorman , Vlastimil Babka , Tom Lendacky , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jianxiong Gao , stable@vger.kernel.org Subject: Re: [PATCH] mm: Fix endless reclaim on machines with unaccepted memory. Message-ID: References: <20240716130013.1997325-1-kirill.shutemov@linux.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: X-Stat-Signature: b139ryq61wzqtxs5g3gu9q9co5zz6jkn X-Rspamd-Queue-Id: 3401D4001D X-Rspam-User: X-Rspamd-Server: rspam08 X-HE-Tag: 1721657277-696300 X-HE-Meta: U2FsdGVkX19umXxXNd9In/k++U8j/trrJji0d/MfQKDhfn0Tuqc3lXQ0nDukrUIKctXmkpRt8I1ktk7USN3MMnC0JWKEK4D1Qc9Nwnm3sGVSEdnO/VIeOXqZZWBgYeZ9y9QS7FUAJu93I6LhuQQvjRYFfTGgUItgzVJvnhsdzScXzC27YGhQOGF8yH3YuGeXQ+alIcoUEcatQI0VqimLd7sd4QJx/x+TRG2kRayspLEAu/auH7V0j49momESJVV8hMDx5v0UZiTSmvSaZOM66tatXqj/YWIvW7JmBHZYurMXb4ZP52HvoMwqXh2NRTNqSlYp3LvA5CkaxgDHCDn4REgqalqm2xcFwNXxYMgBN8XYHWDL8a6vyxBlzHLUtpR55DowaEsaH6/i5Yojuj200pz6Ph66WDMAzLtit6WUS9wpGD5WnfABf1WQgXmFGFXU7lWV4nk66gwxyKRieODWyOIVfZwuf5m2rFS/vwPEhHqlaR1YmkMjLzMMn+fA4bvHFtZqMH4d55IKnTzqXvqKeeA6nzsM5cUXbPrj5hpf99p2l3tlRUKAmG00AUtKZyzKa5W6NwOHyyX7lZf6x31ee5cI4YrqFsIqFHGQcVt1gGMoAOzpDnMuvKEzeSM5efgdCH5B2EVJ0MJjGoAtk3kf00LLQQjQvHdAaAu0QRyAfmA6Uqa1GB/6G7sVqV0vmDCyY0iDbUsnztvAN6hcluN2F3qtndd5n4o/HG32AbSjZIdqY4MCo9LUsZ2Fwb4pOeNT1jWwf4CgO8pmrMiBPTREPleQ9EOo9/aAp734D1IPknMGksQXIMMZFlGYaLIvfvonrODTpMSWgaA3bjg/gNVj+Tql9gw3xoiRqQ0NmXW2PNwZ2S7Hb2M7dQU6Y2BV7BO1BNoxVRxlVEbju+jFlfK9NcrITBfYQuWpyVLctA4FgkWMU+oUl2vRfrBOn8HA/D2M1uccfWLLE+UfN0E+Xdq /epNIvjT +GZ/KeCi504DVfu7/6Mu1W+VA9XPdu3mN6RX+t2rqK2ZF+Fl4D9Bl8vwnlcYAU+vM/ksv49nhoNS9ABfQoDP1cW5OovZiaYoY7uGrLD4c8F+nQz5Bd1RzI3OF7G4t5TgGvKlHOVwDjmM/CLqeG0MTiYRuB/wuOfXK3cRy X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Jul 17, 2024 at 02:06:46PM +0200, Michal Hocko wrote: > Please try to investigate this further. The patch as is looks rather > questionable to me TBH. Spilling unaccepted memory into the reclaim > seems like something we should avoid if possible as this is something Okay, I believe I have a better understanding of the situation: - __alloc_pages_bulk() takes pages from the free list without accepting more memory. This can cause number of free pages to fall below the watermark. This issue can be resolved by accepting more memory in __alloc_pages_bulk() if the watermark check fails. The problem is not only related to unallocated memory. I think the deferred page initialization mechanism could result in premature OOM if __alloc_pages_bulk() allocates pages faster than deferred page initialization can add them to the free lists. However, this scenario is unlikely. - There is nothing that compels the kernel to accept more memory after the watermarks have been calculated in __setup_per_zone_wmarks(). This can put us under the watermark. This issue can be resolved by accepting memory up to the watermark after the watermarks have been initialized. - Once kswapd is started, it will begin spinning if we are below the watermark and there is no memory that can be reclaimed. Once the above problems are fixed, the issue will be resolved. - The kernel needs to accept memory up to the PROMO watermark. This will prevent unaccepted memory from interfering with NUMA balancing. The patch below addresses the issues I listed earlier. It is not yet ready for application. Please see the issues listed below. Andrew, please drop the current patch. There are a few more things I am worried about: - The current get_page_from_freelist() and patched __alloc_pages_bulk() only try to accept memory if the requested (alloc_flags & ALLOC_WMARK_MASK) watermark check fails. For example, if a requested allocation with ALLOC_WMARK_MIN is called, we will not try to accept more memory, which could potentially put us under the high/promo watermark and cause the following kswapd start to get us into an endless loop. Do we want to make memory acceptance in these paths independent of alloc_flags? - __isolate_free_page() removes a page from the free list without accepting new memory. The function is called with the zone lock taken. It is bad idea to accept memory while holding the zone lock, but the alternative of pushing the accept to the caller is not much better. I have not observed any issues caused by __isolate_free_page() in practice, but there is no reason why it couldn't potentially cause problems. - The function take_pages_off_buddy() also removes pages from the free list without accepting new memory. Unlike the function __isolate_free_page(), it is called without the zone lock being held, so we can accept memory there. I believe we should do so. I understand why adding unaccepted memory handling into the reclaim path is questionable. However, it may be the best way to handle cases like __isolate_free_page() and possibly others in the future that directly take memory from free lists. Any thoughts? I am still new to reclaim code and may be overlooking something significant. Please correct any misconceptions you see. diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index c11b7cde81ef..5e0bdfbe2f1f 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -667,6 +667,7 @@ enum zone_watermarks { #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) #define low_wmark_pages(z) (z->_watermark[WMARK_LOW] + z->watermark_boost) #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost) +#define promo_wmark_pages(z) (z->_watermark[WMARK_PROMO] + z->watermark_boost) #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost) /* diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index c62805dbd608..d537c633c6e9 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -1748,7 +1748,7 @@ static bool pgdat_free_space_enough(struct pglist_data *pgdat) continue; if (zone_watermark_ok(zone, 0, - wmark_pages(zone, WMARK_PROMO) + enough_wmark, + promo_wmark_pages(zone) + enough_wmark, ZONE_MOVABLE, 0)) return true; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 14d39f34d336..b744743d14a2 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4462,6 +4462,22 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid, alloc_flags, gfp)) { break; } + + if (has_unaccepted_memory()) { + if (try_to_accept_memory(zone, 0)) + break; + } + +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT + /* + * Watermark failed for this zone, but see if we can + * grow this zone if it contains deferred pages. + */ + if (deferred_pages_enabled()) { + if (_deferred_grow_zone(zone, 0)) + break; + } +#endif } /* @@ -5899,6 +5915,9 @@ static void __setup_per_zone_wmarks(void) zone->_watermark[WMARK_PROMO] = high_wmark_pages(zone) + tmp; spin_unlock_irqrestore(&zone->lock, flags); + + if (managed_zone(zone)) + try_to_accept_memory(zone, 0); } /* update totalreserve_pages */ @@ -6866,8 +6885,8 @@ static bool try_to_accept_memory(struct zone *zone, unsigned int order) long to_accept; int ret = false; - /* How much to accept to get to high watermark? */ - to_accept = high_wmark_pages(zone) - + /* How much to accept to get to promo watermark? */ + to_accept = wmark_pages(zone, WMARK_PROMO) - (zone_page_state(zone, NR_FREE_PAGES) - __zone_watermark_unusable_free(zone, order, 0)); diff --git a/mm/vmscan.c b/mm/vmscan.c index 3ef654addd44..d20242e36904 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -6607,7 +6607,7 @@ static bool pgdat_balanced(pg_data_t *pgdat, int order, int highest_zoneidx) continue; if (sysctl_numa_balancing_mode & NUMA_BALANCING_MEMORY_TIERING) - mark = wmark_pages(zone, WMARK_PROMO); + mark = promo_wmark_pages(zone); else mark = high_wmark_pages(zone); if (zone_watermark_ok_safe(zone, order, mark, highest_zoneidx)) -- Kiryl Shutsemau / Kirill A. Shutemov